[Corpora-List] Query on the use of Google for corpus research

Wed Jun 1 13:35:37 UTC 2005

> On May 31, 2005, at 6:56 PM, Marco Baroni wrote:
> >  it is a good idea to develop/gather/share
> > tools and procedures to get them in "corpus format"...
>
> I have not followed this discussion very closely, so forgive me if I
> am asking the obvious--but I wonder what you mean by "corpus format"?

Sorry if I was vague. I meant something like: to transform raw data
gathered from the web into something that can be used as a corpus.
Minimally, that would mean making sure that all documents are in the same
character encoding, I guess, but of course a good deal of post-processing
(html/boilerplate stripping, (near-)duplicate detection, language
identification...), annotation (POS, lemmatization, meta-information...),
indexing with CWB or XAIRA or similar tools, etc., would be highly
desirable.

Regards,

Marco