[Corpora-List] Query on the use of Google for corpus research
Marco Baroni
baroni at sslmit.unibo.it
Wed Jun 1 13:35:37 UTC 2005
> On May 31, 2005, at 6:56 PM, Marco Baroni wrote:
> > it is a good idea to develop/gather/share
> > tools and procedures to get them in "corpus format"...
>
> I have not followed this discussion very closely, so forgive me if I
> am asking the obvious--but I wonder what you mean by "corpus format"?
Sorry if I was vague. I meant something like: to transform raw data
gathered from the web into something that can be used as a corpus.
Minimally, that would mean making sure that all documents are in the same
character encoding, I guess, but of course a good deal of post-processing
(html/boilerplate stripping, (near-)duplicate detection, language
identification...), annotation (POS, lemmatization, meta-information...),
indexing with CWB or XAIRA or similar tools, etc., would be highly
desirable.
Regards,
Marco
More information about the Corpora
mailing list