[Corpora-List] Book: Building and Exploring Web Corpora

Cédrick Fairon Cedrick.Fairon at uclouvain.be
Thu Oct 25 11:24:22 UTC 2007


Dear Colleagues,

I am pleased to announce the publication of:

"Building and Exploring Web Corpora"
Proceedings of the 3rd web as corpus workshop, incorporating cleaneval
Cédrick FAIRON , Hubert NAETS, Adam KILGARRIFF et Gilles-Maurice de  
SCHRYVER (eds)
In Cahiers du CENTAL, Presses universitaires de Louvain, Louvain-la- 
Neuve, 2007

It is available in PDF format and in printed version.
Table of content, information and order: see http://www.i6doc.com/ 
docs/cental4

Summary

WAC
More and more people are using Web data for linguistic and NLP  
research. The Web as Corpusworkshop (WAC) provides a venue for  
exploring how we can use it effectively and the advancementsto which  
this could lead.This book is a collection of the talks presented at  
the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the  
description of Web corpus collection projects, the exploration of Web  
datacharacteristics from a linguistics/NLP perspective, and on the  
use of crawled Web data for NLPpurposes.

CLEANEVAL
Any use of Web data requires that it be cleaned in order to get rid  
of unwanted material including,for example, HTML markup, navigation  
bars, advertisements. To date there has been no sharingof resources  
or expertise in this particular domain and the cleaning has often  
been done minimally.Cleaneval was an exercise aimed at promoting  
collaboration and improving our understandingof the issues. Results  
and perspectives are presented in this book.

Cédrick Fairon
cedrick.fairon at uclouvain.be

Directeur du CENTAL
Centre de traitement automatique du langage
Université catholique de Louvain
Place Blaise Pascal, 1
1348 Louvain-la-Neuve
Belgique
tel: +32 10 47 37 88
fax: +32 10 47 26 06

http://cental.fltr.ucl.ac.be
http://glossa.fltr.ucl.ac.be




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list