[Corpora-List] SQL -- thanks, and preliminary results on tagging a 20m word corpus (fwd)

Listserv Administrator listman at listserv.linguistlist.org
Sat Aug 4 18:06:24 UTC 2007


---------- Forwarded message ----------
Date: Thu, 12 Jul 2007 17:42:51 -0600
From: Mark Davies <Mark_Davies at byu.edu>
To: corpora at uib.no
Subject: [Corpora-List] SQL -- thanks,
     and preliminary results on tagging a 20m word corpus

Thanks to all of you who sent suggestions on the SQL statements. Piecing
together bits from all of the suggestions, I was able to create and
update the necessary tables.

I used the data from a 20 million word tagged corpus of Spanish to
create 1, 2, and 3-gram tables (words/POS/lemma for each "slot"), and
then ran queries to match these up with 2-grams, 3-grams, etc in the
untagged corpus. I first used the 3-grams table (tagging the middle
word, with one word of context on each side), then the still untagged
2-grams (one word of context to the left, and then to the right), etc. I
was able to tag the 20m word corpus in about 20 minutes total, once the
n-grams tables were set up.

It seemed to work quite well -- at least at first glance -- looking at
15-20 particularly problematic words in Spanish. While this approach
certainly wouldn't be the last step in tagging a text, it may be a way
to get things in shape for more sophisticated (and probably
time-intensive) processes.

I've got in an n-grams relational database of the info for the 100m word
British National Corpus (word+POS; same info used for
http://corpus.byu.edu/bnc), and may try this for English as well, by
applying the BNC data to an untagged corpus of English.

In summary, while there are certainly many approaches to tagging a
corpus, this relational database approach appears to have some merit as
well.

Thanks again for the input.

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================


_______________________________________________
corpora mailing list
corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list