Job: Internship Announcement (4 months), Large-Scale Socio-Semantic Corpus Indexing And Analysis

Sun Jun 30 17:36:38 UTC 2013

Date: Fri, 28 Jun 2013 12:01:18 +0200 (CEST)
From: Pascale Sebillot <pascale.sebillot at irisa.fr>
Message-ID: <717388927.3792745.1372413678721.JavaMail.root at irisa.fr>
X-url: http://www.cnrs.fr/mi/spip.php?article53
X-url: http://mastodons.lip6.fr/

Internship Announcement (4 months) 

LARGE-SCALE SOCIO-SEMANTIC CORPUS INDEXING AND ANALYSIS

CNRS MASTODONS / ARESOS 

ISC-PIF (Paris), LIP6 (Paris), IRISA (Rennes) 

In the context of the ARESOS project funded by the French national CNRS
action MASTODONS (Grandes masses de données scientifiques,
http://www.cnrs.fr/mi/spip.php?article53 ), we are proposing a paid
internship in the area of large-scale content indexing and analysis. The
covered period is 4 months from September 1st to December 31, 2013..

Participating Labs : ISC-PIF (Paris), LIP6 (Paris), IRISA (Rennes) 
Location : The work will be carried out at Paris and/or Rennes (open for
discussion) with some short visits in the other laboratories. 
Funding : CNRS Mastodons ( http://mastodons.lip6.fr/ ) 
Contact : bernd.amann at lip6.fr 
Language : English and/or French 
Gross salary : to be defined depending on the qualification of the
candidate

Period : September 1st 2013 - December 31, 2013 (4 months) 

CONTEXT 

The starting point of this project is an existing platform for the
phylogenetic analysis of evolving text corpora (blog, forum,
bibliography) developed by David Chavalarias (CAMS/CNRS) and
Jean-Philippe Cointet (INRA-SenS), who are also members of the Institut
des Systèmes Complexes de Paris Ile-de-France ( http://www.iscpif.fr/
). This platform includes several specialized tools for the extraction,
indexing, clustering and alignment of contents which can be combined
into complex workflows for the semantic and temporal analysis of
evolving text corpora.

OBJECTIVES 

Current workflow implementations are optimized for medium sized corpora
(~300k items, ~4000 concepts). Our goal is to develop new techniques and
algorithms for processing larger collections and terminologies and which
allow for more interactive content analysis user interfaces:

    * Large-scale phylogenetic analysis of evolving corpora : we propose
      to explore two directions for extending the platform:

        1. define and integrate new data structures and indexing methods
           for accessing and joining text contents; in particular,
           explore efficient approximation and compression techniques
           for reducing processing cost and data size (similarity and
           co-occurrence matrix).

        2. study the parallelization and MapReduce implementation of
           some key functionalities (extraction, indexing, clustering) 

    * Online analysis of dynamic information: The analysis workflow
      depends on complex temporal and semantic processing steps which
      might generate results which are erroneous or useless for the
      final user. The goal is to adapt the current workflow for
      following the content evolution online by incrementally updating
      the generated analytic data model (incrementally maintain the
      temporal co-occurrence matrix). We propose to study to what extent
      it is possible to regroup data according to the user interests and
      to decompose the workflow into independent processes that can be
      executed in parallel (MapReduce). 

The development of these techniques and algorithms will build on the
experience of two database research teams including B. Amann,
C. Constantin and H. Naacke from LIP6-UPMC ( http://www-bd.lip6.fr/ )
and D. Gross-Amblard and Z. Miklos from IRISA-Rennes (
http://people.irisa.fr/David.Gross_Amblard/ and
http://people.irisa.fr/Zoltan.Miklos/ ).

WORK PLAN 

The work to be accomplished is decomposed into several steps: 

    1. Problem analysis and state-of-the art 
    2. Detailed task specification 
    3. Implementation and experimentation on large corpora
       (WebOfScience, GlobalPulse Twitter collection) 

CANDIDATE PROFILE 

Candidates should send their CV to bernd.amann at lip6.fr . We are
particularly interested in applicants holding at least a Masters diploma
with some background covering one or several topics among large-scale
data processing, unsupervised / semi-supervised machine learning, text
mining and natural language processing (NLP). Due to the near deadline
of this proposal, candidates from foreign countries should already have
a work permit covering the internship period.

BIBLIOGRAPHY 

Chavalarias D, Cointet J-P (2013) Phylomemetic Patterns in Science
Evolution-The Rise and Fall of Scientific Fields. PLoS ONE 8(2):
e54847. doi:10.1371/journal.pone.0054847

Kyuseok Shim. MapReduce Algorithms for Big Data Analysis, Tutorial
(VLDB12, SWCW13)

Richard M. C. McCreadie, Craig Macdonald, and Iadh Ounis. 2009. On
single-pass indexing with MapReduce. In Proceedings of the 32nd
international ACM SIGIR conference on Research and development in
information retrieval (SIGIR '09). ACM, New York, NY, USA,
742-743. DOI=10.1145/1571941.1572106
http://doi.acm.org/10.1145/1571941.1572106

Spiros Papadimitriou and Jimeng Sun. 2008. DisCo: Distributed
Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale
End-to-End Mining. In Proceedings of the 2008 Eighth IEEE International
Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington,
DC, USA, 512-521. DOI=10.1109/ICDM.2008.142
http://dx.doi.org/10.1109/ICDM.2008.142 --

Bernd Amann
Université Pierre et Marie Curie 
LIP6 - Boite Courrier 169
4 place Jussieu
75 252 Paris cedex 05

Equipe Base de Données
Bureau 25-26/506
Mél : Bernd.Amann at lip6.fr Tél : +33 (0)1 44 27 70 09
Fax : +33 (0)1 44 27 70 00

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/
-------------------------------------------------------------------------