[Corpora-List] Call for applicants : PhD on incremental text mining

Jean-Charles Lamirel jean-charles.lamirel at loria.fr
Sat May 18 12:04:53 UTC 2013


Call for applicants 


PhD Topic: 

Setting up and validating new mining methods for the management 
of incremental textual data and hybrid dynamic data 
Manager : Dr. Habil. Jean-Charles Lamirel. 
Research team : SYNALP-LORIA , Universiy : Université de Lorraine 1. Context 

The literature taking into account the chronological aspect in the flows of information focuses on the "DataStream" whose main concern is the on-the-fly management of non-initially stored data, likely to be changing in their nature. Research on "DataStream" was initiated in 1996 by DARPA through the TDT project ([5] [6] [28]). The data usually considered are essentially physical measures or Web use data (connection, browsing, etc.). Applications on text (bibliographic databases, online journals, data issued from interaction streams ...), or hybrid data, mixing texts and numerical experimental results in a temporal context, such as bioinformatics data, are still stammering. In addition, the existing algorithms are designed to handle very large volumes of data and are not optimum for tasks requiring more precise and accurate analysis such as, the detection of emerging topics in research or technological survey, the dynamic monitoring of the interaction with an end-user. In a static context, such tasks also concern the analysis of heterogeneous data including multiple topics or the study of data issued from complex processes, like the NLP data. 2. PhD goal 

The purpose of the PhD will be to explore several different approaches for accurately mining and analyzing textual data with multiple components of static or incremental nature. The dynamic or incremental framework will be preferred. 


A first approach which will have to be studied is the decomposition of the analysis in time steps. The principle of this approach is to carry out the static classifications of data groups associated with different time windows, or time steps. Comparison of static classifications obtained for the different groups is exploited to isolate changes appearing at each time period. This approach could draw inspiration from the original principles proposed by the multiview data analysis paradigm (MVDA) developed by the SYNALP team ([2] [3] [13] [15] [16] [17] [20]), such as intelligent labeling, unsupervised Bayesian reasoning, online generalization, or unbiased symbolico-numerical quality measures for classification. It will require to compare the behavior of different usual static classification methods on textual data, as well as to analyze the influence of the distances used in these methods, and to propose new alternatives. The implemented approach will also have to be compared or merged with alternative approaches, such as those based on latent Dirichlet allocation processes [23], Independent Component Analysis (ICA) or KL-divergence [1], or those based on novelty detection neural filters [12]. 


A second approach will be to develop unsupervised fine-grained classification methods that should work on dynamic data. This flow-oriented approach implies that the proposed methods should have the ability to react and to adapt their results to the appearance of each new data. The implementation of such approaches may be inspired by the classification methods showing the best potential for incremental classification, such as the neural methods ([10] [11] [24] [25] [26] [27]) or the density‑based methods ([4] [6] [7]). Among others things, it will consist in defining new rules for local learning that replace the rules of global learning usually employed in static versions of these methods. It will be possible, in this context, to draw inspiration from the promising incremental techniques currently experienced in the SYNALP team, such as those proposed in the IGNGF method [18] [19] [21], or in its recent extensions based on contrast functions derived from the highly efficient feature maximization metric [22]. 3. Experimental data 

The environment of development of the methods will be the one of a collaborative research project, involving several regional research teams, namely the Thomson Reuters Innovation platform (TRI). TRI is a multidisciplinary platform giving access, in a coordinated manner, to a very wide range of scientific publications, to a world reference collection of patents, and the world reference Web of Science (WoS) citation network. It includes powerful preprocessing tools permitting to easily build up time-stamped test datasets and additional tools for validating the results. 


Once the methodology has been stabilized, the main focus of study in the upcoming experiments will be the one of bioinformatics. It will involve to manage in a coordinated way textual data and numerical data issued from biological experiments on evolutionary processes. Within this framework, we have partnered with the Taiwanese IIR (Intelligent Information Retrieval) laboratory, attached to the National Science Council of Taiwan, and with the American NIEHS (Intelligent Information Retrieval) laboratory to set up a platform of intelligent gene annotation. The latter will specifically involve the MVDA model that we developed for the management of multiple sources, the syntactic‑semantic parsers developed by the IIR laboratory for management of the textual data involved in the analysis, and numerical data issued from DNA microarrays that will be normalized by the use of NIEHS laboratory protocols. 


Full-scale experiments, like those already started by the SYNALP team [9], should also be carried out in parallel way with the proposed methods for the treatment of static linguistic data and for the one of dynamic interaction data. 


4. Programming languages : Matlab, Java, C, C++. 
5. Funding: 3 years funding is offered if the challenging student is selected by the LORIA laboratory PhD board. 
6. Contact: Jean-Charles Lamirel - email : lamirel at loria.fr – gsm : +33824365491 
7. Submission deadline : May 28th, 2013. 


8. Documents to be provided for application (electronic version) 


· A full motivation letter arguing why you choose the topic and which are you skills for a success story related to that topic. 
· Recommendation letters from our former teachers (mandatory) and from your company managers (optional). Three letters would be nice. 
· Copy of diplomas and rates with a special focus on our Bachelor and Master degrees. 
· Master report if it is in English. 
· Copy of published papers if they are in English. 
· Detailed resume including full contact address, birthdate and full diploma storyboard. Also including our experience in research, all internships, jobs, and skill in programming. 
· A special focus on our experience in machine learning, statistics, and NLP domains would be useful. 

9. References 


[1] Aksoy C. (2010). Novelty Detection in Topic Tracking, Master Thesis (Advisor Kan F.), Bilkent University, Turkey, July 2010. 
[2] Al-Shehabi S., Lamirel J.-C. (2004). Inference Bayesian Network for Multi-topographic neural network communication: a case study in documentary data. Proceedings of ICTTA, Damas, Syria. 
[3] Al-Shehabi S., Lamirel J.-C. (2005). Multi-Topographic Neural Network Communication and Generalization for Multi-Viewpoint Analysis. International Joint Conference on Neural Networks – IJCNN'05, Montreal, Canada. 
[4] Al Shehabi S., Lamirel J.-C. (2006). A new hyperbolic visualization method for displaying the results of a neural gas model: application to webometrics. Proceedings of the 14th European Symposium on Artificial Neural Networks (ESANN), Brugges, Belgia, April. 
[5] Allan J., Carbonell J., Doddington, G.,. Yamron J., Yang Y. (1998). Topic detection and tracking pilot study, final report. Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia. 
[6] Batagelj, V. and Zaversnik, M. (2002). An O(m) algorithm for cores decomposition of networks, University of Ljubljana, Preprint: IMFM 797. 
[7] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD'96): 226-231, AAAI Press, Menlo Park, CA. 
[8] Gaber M., Zaslavsky A. and Krishnaswamy S. (2005). Mining Data Streams: A Review. SIGMOD Record, 34(2). 
[9] Falk, I., Lamirel J.-C, Gardent C. (2012). Classifying French Verbs Using French and English Lexical Resources, International Conference on Computational Linguistic (ACL 2012), Jeju Island, Korea, July 2012. 
[10] Frizke B. (1995). A growing neural gas network learns topologies, Tesauro G., Touretzky D. S., leen T. K., Eds., Advances in neural Information processing Systems 7, pp 625-632, MIT Press, Cambridge MA. 
[11] Hamza H., Belaïd Y., Belaîd. A, Chaudhuri B. B. (2008). Incremental classification of invoice documents, 19th International Conference on Pattern Recognition - ICPR 2008. 
[12] Kassab R., Lamirel J.-C. (2007). Towards a synthetic analysis of user’s information need for more effective personalized filtering services, Proceedings of the 22th Annual ACM Symposium on Applied Computing (SAC-IAR 2007), Seoul, Korea,, March 2007. 
[13] Kassab R., Lamirel J.-C., (2007). Feature Based Cluster Validation for High Dimensional Data, IASTED International Conference on Artificial Intelligence and Applications (AIA), Innsbruck, Austria, February 2008. 
[14] Kohonen T. (1982). Self-organized formation of topologically correct feature maps, Biological Cybernetics, vol. 43, pp 56-59. 
[15] Lamirel J.-C., Al-Shehabi S., François C., Hoffmann M. (2004). New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics, 60(3). 
[16] Lamirel J.-C., Ta A.P., Attik M. (2007). Novel Labeling Strategies for Hierarchical Representation of Multidimensional Data Analysis Results, IASTED International Conference on Artificial Intelligence and Applications (AIA), Innsbruck, Austria, February 2008. 
[17] Lamirel J.-C. (2010). Vers une approche systémique et multivues pour l’analyse de données et la recherche d’information : un nouveau paradigme, HDR Report, University of Nancy 2, December 2010. 
[18] Lamirel J.-C., Boulila Z., Ghribi M., Cuxac P. (2010). A new incremental growing neural gas algorithm based on clusters labeling maximization: application to clustering of heterogeneous textual data, Proceedings of the 23 rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE 2010), Cordoba, Spain, June 2010. 
[19] Lamirel, J.-C, Mall R., Mall R., Cuxac P., Safi G. (2011). Variations to incremental growing neural gas algorithm based on label maximization, Proceedings of IJCNN 2011, San Jose, CA, USA, August 2011. 
[20] Lamirel J.-C. (2012). A new diachronic methodology for automatizing the analysis of research topics dynamics : an example of application on optoelectronics research, Scientometrics Special issue on 7th International Conference on Webometrics, Informetrics and Scientometrics and 12th COLLNET, Scientometrics 93(1): 151-166 (2012). 
[21] Lamirel J.-C., Reymond D. (2013). Automatic websites classification and retrieval using websites communication signatures, Proceedings of 8th International Conference on Webometrics, Informetrics and Scientometrics (WIS), Seoul, Korea, October 2012. To be pubished in: Journal of Information Management and Scientometrics (JIMS), Special issue on 8th International Conference on Webometrics, Informetrics and Scientometrics and 13th COLLNET. 
[22] Lamirel J.-C., Cuxac P., Chivukula A.S., Hajlaoui K. (2013). A new feature selection and feature contrasting approach based on quality metric: application to efficient classification of complex textual data, QIMIE 2013: 3nd International PAKDD Workshop on Quality Issues, Measures of Interestingness and Evaluation of Data Mining Models, Brisbane, Australia, April 2013. 
[22] Li W., Huang Y. (2011). New Event Detect Based on LDA and Correlation of Subject Terms International Conference on Internet Technology and Applications (iTAP), Wuhan, China, August 2011. 
[23] Martinetz T. et Schulten K. (1991). A "neural gas" network learns topologies. In Kohonen, T., Makisara K., Simula O., and Kangas J., editors, Articial Neural Networks, pp 397-402. Elsevier Amsterdam. 
[24] Merkl D., Shao Hui He, Dittenbach M., and Rauber A. (2003). Adaptive hierarchical incremental grid growing: an architecture for high-dimensional data visualization. In Proceedings of the 4th Workshop on Self-Organizing Maps, Advances in Self-Organizing Maps, pp 293-298, Kitakyushu, Japan, September 11-14 2003. 
[25] Prudent Y., Ennaji A. (2004). Extraction Incrémentale de la topologie des Données, 11èmes Rencontres de la Société Francophone de Classification; pp 278-281. 
[26] Prudent Y., Ennaji, A. (2005). An Incremental Growing Neural Gas learns Topology, ESANN2005, 13th European Symposium on Ar tificial Neural Networks, Bruges, Belgium, 27-29April 205, published in Neural Networks, 2005. IJCNN apos;05. Proceedings. 2005 IEEE International Joint Conference , vol. 2, no. 31 pp 1211 - 1216, July-4 Aug. 2005. 
[27] Wayne C.L. (1998). Topic detection & tracking (TDT): Overview & perspective. Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia. 

Dr habil. Jean-Charles LAMIREL 
Maître de Conférences, Habilité à Diriger des Recherches 
Université de Strasbourg 
Projet INRIA TALARIS - LORIA - Nancy 
GSM : 0624365491 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130518/ac091329/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list