[Corpora-List] Creating a Corpus - Annotator Background?

Fri Aug 7 19:34:03 UTC 2009

Stuart --- 

Good luck with that. 

At a practical level, I'd be curious as to your approach.

As you have recognized, disambiguation relies entirely on context. . 

You need to know at least one or two of the following:

A) Something about the broader subject being discussed. 

Knowing whether the discussion involves movies about pigs, 1930's era sports
figures, Christian iconography, the joys of motherhood or the physical
attributes of women of a certain age provides the context to disambiguate
the word "babe" in a given passage. 

Ironically, I've used corpus architectures to determine the subject of a
text passage as opposed to the other way around. 

Which Babe Did You Mean?

Something about the communicator(s).

Who is it? What do we know about them? What is their level of knowledge
concerning the subject? 

Chomsky always used the example of the woman coming to the Dr complaining of
pain in her "leg."  The Doctor's knowledge about the woman, (very modest,
medically unsophisticated, embarrassed to be talking to a male physician)
helped him disambiguate "leg" without having to ask a lot of questions that
might not get answered correctly, and quickly deal with her vaginal
infection. 

If a member of the Moral Majority uses the term "relativity," the odds are
high that he/she is using a meaning quite different from that of a
theoretical physicist. 

Something about the audience.

We don't necessarily have to know the subject being discussed if we know
enough about the audience. In the "babe" example, it may be enough to know
that the audience (people for whom the communication is intended) are
reading the word in Penthouse, as opposed to Christianity Today, Sports
Illustrated, or Modern Motherhood.

Most of the successful approaches I have encountered focus on some narrow
content subset. Legal text, medical text, and general business news among
others, have well-established categorization architectures. I believe
Thompson published their general news, business information and technology
"knowledge architecture" (categorization structure) several years back. That
would be a great place to start, as it is the broadest of these designs. 

BTW - one thing you'll learn is that the most useful subcategories for
disambiguation are not necessarily subject based. We found that it was
frequently more productive to identify and categorize the intended purpose
of the content- It helped to know what it was that the reader intended to
learn or what the content generator intended to communicate (and to whom).
That approach made the task of developing standardized content "buckets" far
easier and resulted in a system that could disambiguate terms more
accurately than any of the other approaches we had used previously. 

Jack Bryar

Grafton, VT 05146

Office: 802-843-6033

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Stuart Moore
Sent: Friday, August 07, 2009 9:17 AM
To: Corpora at uib.no
Subject: [Corpora-List] Creating a Corpus - Annotator Background?

Hi all

I'm trying to create a word sense disambiguation corpus.

I'm trying to decide what information should be collected on each

annotator - e.g. first language, profession (especially whether

they're a "Professional Linguist", or "Computational Linguistics

Researcher"...), any relevant education ("Degree in Linguistics",

"Degree in English", "Degree in another language")...

Are there any standard questions to ask of this form? I'm hoping to

have 10-20 annotators  from a variety of backgrounds, each annotating

a few hundred examples. I don't think that background will make a

difference on my particular task, but without the right information I

can't actually show that.

I've tried looking for relevant papers, but so far haven't found any

'standard' list of questions to ask. Does anyone have any suggestions?

Many thanks

Stuart Moore

_______________________________________________

Corpora mailing list

Corpora at uib.no

http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090807/28539835/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 328 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090807/28539835/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.gif
Type: image/gif
Size: 28646 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090807/28539835/attachment-0003.gif>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora