Job: Internship Announcement (4 months), Large-Scale Socio-Semantic Corpus Indexing And Analysis
Thierry Hamon
thierry.hamon at UNIV-PARIS13.FR
Sun Jun 30 17:36:38 UTC 2013
Date: Fri, 28 Jun 2013 12:01:18 +0200 (CEST)
From: Pascale Sebillot <pascale.sebillot at irisa.fr>
Message-ID: <717388927.3792745.1372413678721.JavaMail.root at irisa.fr>
X-url: http://www.cnrs.fr/mi/spip.php?article53
X-url: http://mastodons.lip6.fr/
Internship Announcement (4 months)
LARGE-SCALE SOCIO-SEMANTIC CORPUS INDEXING AND ANALYSIS
CNRS MASTODONS / ARESOS
ISC-PIF (Paris), LIP6 (Paris), IRISA (Rennes)
In the context of the ARESOS project funded by the French national CNRS
action MASTODONS (Grandes masses de données scientifiques,
http://www.cnrs.fr/mi/spip.php?article53 ), we are proposing a paid
internship in the area of large-scale content indexing and analysis. The
covered period is 4 months from September 1st to December 31, 2013..
Participating Labs : ISC-PIF (Paris), LIP6 (Paris), IRISA (Rennes)
Location : The work will be carried out at Paris and/or Rennes (open for
discussion) with some short visits in the other laboratories.
Funding : CNRS Mastodons ( http://mastodons.lip6.fr/ )
Contact : bernd.amann at lip6.fr
Language : English and/or French
Gross salary : to be defined depending on the qualification of the
candidate
Period : September 1st 2013 - December 31, 2013 (4 months)
CONTEXT
The starting point of this project is an existing platform for the
phylogenetic analysis of evolving text corpora (blog, forum,
bibliography) developed by David Chavalarias (CAMS/CNRS) and
Jean-Philippe Cointet (INRA-SenS), who are also members of the Institut
des Systèmes Complexes de Paris Ile-de-France ( http://www.iscpif.fr/
). This platform includes several specialized tools for the extraction,
indexing, clustering and alignment of contents which can be combined
into complex workflows for the semantic and temporal analysis of
evolving text corpora.
OBJECTIVES
Current workflow implementations are optimized for medium sized corpora
(~300k items, ~4000 concepts). Our goal is to develop new techniques and
algorithms for processing larger collections and terminologies and which
allow for more interactive content analysis user interfaces:
* Large-scale phylogenetic analysis of evolving corpora : we propose
to explore two directions for extending the platform:
1. define and integrate new data structures and indexing methods
for accessing and joining text contents; in particular,
explore efficient approximation and compression techniques
for reducing processing cost and data size (similarity and
co-occurrence matrix).
2. study the parallelization and MapReduce implementation of
some key functionalities (extraction, indexing, clustering)
* Online analysis of dynamic information: The analysis workflow
depends on complex temporal and semantic processing steps which
might generate results which are erroneous or useless for the
final user. The goal is to adapt the current workflow for
following the content evolution online by incrementally updating
the generated analytic data model (incrementally maintain the
temporal co-occurrence matrix). We propose to study to what extent
it is possible to regroup data according to the user interests and
to decompose the workflow into independent processes that can be
executed in parallel (MapReduce).
The development of these techniques and algorithms will build on the
experience of two database research teams including B. Amann,
C. Constantin and H. Naacke from LIP6-UPMC ( http://www-bd.lip6.fr/ )
and D. Gross-Amblard and Z. Miklos from IRISA-Rennes (
http://people.irisa.fr/David.Gross_Amblard/ and
http://people.irisa.fr/Zoltan.Miklos/ ).
WORK PLAN
The work to be accomplished is decomposed into several steps:
1. Problem analysis and state-of-the art
2. Detailed task specification
3. Implementation and experimentation on large corpora
(WebOfScience, GlobalPulse Twitter collection)
CANDIDATE PROFILE
Candidates should send their CV to bernd.amann at lip6.fr . We are
particularly interested in applicants holding at least a Masters diploma
with some background covering one or several topics among large-scale
data processing, unsupervised / semi-supervised machine learning, text
mining and natural language processing (NLP). Due to the near deadline
of this proposal, candidates from foreign countries should already have
a work permit covering the internship period.
BIBLIOGRAPHY
Chavalarias D, Cointet J-P (2013) Phylomemetic Patterns in Science
Evolution-The Rise and Fall of Scientific Fields. PLoS ONE 8(2):
e54847. doi:10.1371/journal.pone.0054847
Kyuseok Shim. MapReduce Algorithms for Big Data Analysis, Tutorial
(VLDB12, SWCW13)
Richard M. C. McCreadie, Craig Macdonald, and Iadh Ounis. 2009. On
single-pass indexing with MapReduce. In Proceedings of the 32nd
international ACM SIGIR conference on Research and development in
information retrieval (SIGIR '09). ACM, New York, NY, USA,
742-743. DOI=10.1145/1571941.1572106
http://doi.acm.org/10.1145/1571941.1572106
Spiros Papadimitriou and Jimeng Sun. 2008. DisCo: Distributed
Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale
End-to-End Mining. In Proceedings of the 2008 Eighth IEEE International
Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington,
DC, USA, 512-521. DOI=10.1109/ICDM.2008.142
http://dx.doi.org/10.1109/ICDM.2008.142 --
Bernd Amann
Université Pierre et Marie Curie
LIP6 - Boite Courrier 169
4 place Jussieu
75 252 Paris cedex 05
Equipe Base de Données
Bureau 25-26/506
Mél : Bernd.Amann at lip6.fr Tél : +33 (0)1 44 27 70 09
Fax : +33 (0)1 44 27 70 00
-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version :
Archives : http://listserv.linguistlist.org/archives/ln.html
http://liste.cines.fr/info/ln
La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion : http://www.atala.org/
-------------------------------------------------------------------------
More information about the Ln
mailing list