[Corpora-List] Free text corpora?

Alexandre Rafalovitch arafalov at gmail.com
Wed Mar 3 21:46:13 UTC 2010


On Wed, Mar 3, 2010 at 4:53 AM, Yannick Versley
<versley at sfs.uni-tuebingen.de> wrote:
...
> even carefully balanced corpora tend not to be
> representative of the finest genre distinctions - e.g., car repair manuals
> standing in as representative for all kinds of repair manuals, one magazine
> (possibly containing the idiolect of only a small group of people) standing
> in for magazines in general, or one kind of talk interactions (people doing
> small talk with a linguist nearby) standing in for all "oral" language.

I just wanted to add to this (without taking any sides). If you look
at Named Entity  recognition literature in computational linguistics
(MUC-6 and especially MUC-7), they talk about 'Open Domain' and how
their statistical algorithms can adjust to any type of corpora.

Reading the actual papers, that turn out to mostly mean news article
with reasonably good English grammar and with named entities (like
people and company names) that are only a couple of tokens long, do
not nest and certainly do not include punctuation or conjunctions.

Some 'open' domain that turns out to be! I am looking at named
entities mentions of 40 tokens with punctuation,  conjunctions and
repeated internal grammatical structures. And that's just my domain. I
am not even talking about biomedical literature. "Open domain" in this
context starts to sound like "Modern art"...

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list