[Corpora-List] Book: Building and Exploring Web Corpora
Cédrick Fairon
Cedrick.Fairon at uclouvain.be
Thu Oct 25 11:24:22 UTC 2007
Dear Colleagues,
I am pleased to announce the publication of:
"Building and Exploring Web Corpora"
Proceedings of the 3rd web as corpus workshop, incorporating cleaneval
Cédrick FAIRON , Hubert NAETS, Adam KILGARRIFF et Gilles-Maurice de
SCHRYVER (eds)
In Cahiers du CENTAL, Presses universitaires de Louvain, Louvain-la-
Neuve, 2007
It is available in PDF format and in printed version.
Table of content, information and order: see http://www.i6doc.com/
docs/cental4
Summary
WAC
More and more people are using Web data for linguistic and NLP
research. The Web as Corpusworkshop (WAC) provides a venue for
exploring how we can use it effectively and the advancementsto which
this could lead.This book is a collection of the talks presented at
the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the
description of Web corpus collection projects, the exploration of Web
datacharacteristics from a linguistics/NLP perspective, and on the
use of crawled Web data for NLPpurposes.
CLEANEVAL
Any use of Web data requires that it be cleaned in order to get rid
of unwanted material including,for example, HTML markup, navigation
bars, advertisements. To date there has been no sharingof resources
or expertise in this particular domain and the cleaning has often
been done minimally.Cleaneval was an exercise aimed at promoting
collaboration and improving our understandingof the issues. Results
and perspectives are presented in this book.
Cédrick Fairon
cedrick.fairon at uclouvain.be
Directeur du CENTAL
Centre de traitement automatique du langage
Université catholique de Louvain
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
Belgique
tel: +32 10 47 37 88
fax: +32 10 47 26 06
http://cental.fltr.ucl.ac.be
http://glossa.fltr.ucl.ac.be
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list