Livre: Schafer and Bildhauer, Web Corpus Construction

Thierry Hamon hamon at LIMSI.FR
Wed Apr 23 18:24:04 UTC 2014

Date: Mon, 21 Apr 2014 08:52:25 +1000
From: Graeme Hirst <gh at>
Message-Id: <BC8926CF-FA89-4206-B5D3-2BCE514A5279 at>


Web Corpus Construction

by Roland Schäfer and Felix Bildhauer 
(Freie Universität Berlin, Germany)

Synthesis Lectures on Human Language Technologies #22 (Morgan & Claypool
Publishers), 2013, 145 pages


The World Wide Web constitutes the largest existing source of texts
written in a great variety of languages. A feasible and sound way of
exploiting this data for linguistic research is to compile a static
corpus for a given language. There are several adavantages of this
approach: (i) Working with such corpora obviates the problems
encountered when using Internet search engines in quantitative
linguistic research (such as non-transparent ranking algorithms).  (ii)
Creating a corpus from web data is virtually free. (iii) The size of
corpora compiled from the WWW may exceed by several orders of magnitudes
the size of language resources offered elsewhere. (iv) The data is
locally available to the user, and it can be linguistically
post-processed and queried with the tools preferred by her/him. This
book addresses the main practical tasks in the creation of web corpora
up to giga-token size. Among these tasks are the sampling process (i.e.,
web crawling) and the usual cleanups including boilerplate removal and
removal of duplicated content. Linguistic processing and problems with
linguistic processing coming from the different kinds of noise in web
corpora are also covered. Finally, the authors show how web corpora can
be evaluated and compared to other corpora (such as traditionally
compiled corpora).

For additional material please visit the companion website:

Table of Contents: Preface / Acknowledgments / Web Corpora / Data
Collection / Post-Processing / Linguistic Processing / Corpus 
Evaluation and Comparison / Bibliography / Authors' Biographies

This title is available online without charge to members of institutions
that have licensed the Synthesis Digital Library of Engineering and
Computer Science.  Members of licensing institutions have unlimited
access to download, save, and print the PDF without restriction; use of
the book as a course text is encouraged.  To find out whether your
institution is a subscriber, visit
<>, or just click on the
book's URL above from an institutional IP address and attempt to
download the PDF.  Others may purchase the book from this URL as a PDF
download for US$30 or in print for US$40.  Printed copies are also
available from Amazon and from booksellers worldwide at approximately
US$40 or local currency equivalent.

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

ATALA décline toute responsabilité concernant le contenu des
messages diffusés sur la liste LN

More information about the Ln mailing list