[Corpora-List] Query on the use of Google for corpus research
Marco Baroni
baroni at sslmit.unibo.it
Wed Jun 1 15:36:49 UTC 2005
Your tools sound really interesing, and in part similar to what we are
developing/adapting. Is anything (besides GATES, of course) publicly
available?
> (PDF, Word, etc.) and strips out the text, does its best to identify
> titles, tables, etc. and mark them as such
So, here is where you identify the parts of a page that are probably not
worth keeping, or that should at least be marked as something else than
natural connected text? (E.g., header and footer material that is repeated
on many pages from the same site?) Delimiting these seems to be one of the
most annoying problems we are encountering right now...
Regards,
Marco
More information about the Corpora
mailing list