[Corpora-List] Query on the use of Google for corpus research

Wed Jun 1 15:06:21 UTC 2005

On Jun 1, 2005, at 9:35 AM, Marco Baroni wrote:

> Sorry if I was vague. I meant something like: to transform raw data
> gathered from the web into something that can be used as a corpus.
> Minimally, that would mean making sure that all documents are in
> the same
> character encoding, I guess, but of course a good deal of post-
> processing
> (html/boilerplate stripping, (near-)duplicate detection, language
> identification...), annotation (POS, lemmatization, meta-
> information...),
> indexing with CWB or XAIRA or similar tools, etc., would be highly
> desirable.
>

We've actually done a lot of that in the process of developing the
American National Corpus. We have gotten data off the web in several
formats, but for our purposes the data has to be American English,
produced post-1990, and not under any copyright constraints, so we
are a bit more picky about what we download than the "web as corpus"
approach dictates. We have a pipeline that takes data in most formats
(PDF, Word, etc.) and strips out the text, does its best to identify
titles, tables, etc. and mark them as such, and runs it through GATE
(http://gate.ac.uk), (using some additional GATE plugins we've
developed) to do tokenization, sentence splitting, POS tagging, noun
and verb phrase chunking, etc. We dump it out in our XML stand-off
format in UTF-16 for raw data and UTF-8 for annotations, but since
the pipeline is modular any step can be replaced with another tool to
do something differently.  We also deal with HTML, but because users
can use HTML tags any way they like (e.g. <p> and <font> tags for
headers instead of <h1> etc.), and no two documents are ever the same
(it seems), this is more labor-intensive. We also have a tool for
near-duplicate detection, which was used on NYTimes data but might be
generalizable.

BTW the ANC can be used with XAIRA--see http://
AmericanNationalCorpus.org/xaira.html, which provides a few pre-
processing tools that enable indexing the ANC data in XAIRA.

I am not sure if any of what we've done is useful to others, but we
are happy to share anything we have.