27.3671, Diss: Ad hoc and General-Purpose Corpus Construction from Web Sources

Fri Sep 16 19:51:25 UTC 2016

LINGUIST List: Vol-27-3671. Fri Sep 16 2016. ISSN: 1069 - 4875.

Subject: 27.3671, Diss: Ad hoc and General-Purpose Corpus Construction from Web Sources

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry,
                                   Robert Coté, Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Fri, 16 Sep 2016 15:51:11
From: Adrien Barbaresi [adrien.barbaresi at oeaw.ac.at]
Subject: Ad hoc and General-Purpose Corpus Construction from Web Sources

Institution: Ecole Normale Supérieure 
Program: PhD program, school of linguistics 
Dissertation Status: Completed 
Degree Date: 2015 

Author: Adrien Barbaresi

Dissertation Title: Ad hoc and General-Purpose Corpus Construction from Web
Sources 

Dissertation URL:  https://hal.archives-ouvertes.fr/tel-01167309/

Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics

Dissertation Director(s):
Benoît Habert

Dissertation Abstract:

This thesis introduces theoretical and practical reflections on corpus
linguistics, computational linguistics, and web corpus construction. More
specifically, two different types of corpora from the web, specialized (ad
hoc) and general-purpose, are presented and analyzed, including suitable
conditions for their creation.

In a historical perspective, several milestones of corpus design are
presented, from pre-digital corpora at the end of the 1950's to web corpora in
the 2000's and 2010's. Three main phases are distinguished in this evolution,
first the age of copy typing and establishment of the scientific methodology
and tradition regarding corpora, second the age of digitized text and further
development of corpus linguistics, and third the arrival of web data and
“opportunistic'' approaches among researchers. The continuities and changes
between the linguistic tradition are exposed.

In the second chapter, methodological insights on automated text scrutiny are
presented. Readability studies and automated text classification are used as a
paragon of methods to find salient features in order to grasp text
characteristics. As a conclusion, guiding principles for research practice are
listed, and reasons are given to find a balance between quantitative analysis
and corpus linguistics, in an environment which is spanned by technological
innovation and artificial intelligence techniques.

Third, current research on web corpora is summarized. The chapter opens with
notions of “web science''. Then, I examine the issue of data collection, more
specifically in the perspective of URL seeds, both for general and for
specialized corpora. I distinguish two main approaches to web document
retrieval: restricted retrieval, where documents to be retrieved are listed or
even known in advance, and web crawling. I show that the latter case should
not be deemed too complex for linguists, by summarizing different strategies
to find new documents, and discussing their advantages and limitations.
Finally, ways to target small fractions of the Web and afferent issues are
described. In a further section, the notion of web corpus preprocessing is
introduced and salient steps are discussed.

I present my work on web corpus construction in the fourth chapter, with two
types of end products, specialized and even niche corpora on the one hand, and
general-purpose corpora on the other hand. My analyses concern two main
aspects, first the question of corpus sources (or prequalification), and
secondly the problem of including valid, desirable documents in a corpus (or
document qualification). First, I show that it is possible and even desirable
to use sources other than just search engines as state of the art, and I
introduce a light scout approach along with experiments to prove that a
preliminary analysis and selection of crawl sources is possible as well as
profitable. Second, I perform work  on document selection, in order to enhance
web corpus quality in general-purpose approaches, and in order to perform a
suitable quality assessment in the case of specialized corpora. I show that it
is possible to use salient features  inspired from readability studies along
with machine learning approaches in order to improve corpus construction
processes. To this end, I select a number of features extracted from the texts
and tested on an annotated sample of web  texts. Last, I present work on
corpus visualization consisting of extracting certain corpus characteristics
in order to give indications on corpus contents and quality.

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

        Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-27-3671	
----------------------------------------------------------