[Corpora-List] fast string replacement

Damon Allen Davison allolex at gmail.com
Mon Mar 14 11:47:35 UTC 2005


Dear Jörg,

In this case, you'll probably be happier having your lookup dictionary
live in a database because access is faster. You can still use a
scripting language like Perl to do the glue work for you, but it's
conceivable to do this entirely in SQL. We have a very large corpus
collection in our collocations dictionary project
(http://www.romanistik.uni-koeln.de/home/blumenthal/colloc-en.shtml)
which are stored in a MySQL database with one record per token. I have
written a multiword unit tagger in Perl and SQL that works like this:

Given a corpus stored in a MySQL database with one token per record
(in numerical order using fldTokenID as a counter and fldToken as the
actual token) and a multiword lookup table with the order of the
multiword unit's elements clearly marked.

1. Read in a record of my multiword unit lookup table.

2. Use the *final* element of the MWU to create a temporary table with
all occurrences of that element. I wrote the tagger for French, where
the initial element of an MWU is often a preposition or other such
highly frequent part-of-speech.

3. Use the second-to-last MWU element to create a new temporary table
with the following SQL code:

        $query  = "CREATE TABLE mwu_$index ";
        $query .= 'SELECT @a:=(a.fldTokenID-1) AS fldTokenID ';
        $query .= "FROM mwe_$previous_index a ";
        $query .= "INNER JOIN $tablename b ";
        $query .= 'USING(fldTokenID) ';
        $query .= "WHERE b.fldToken = \"$element\""; # you can also
use lemmata--token was just more expedient for me

4. Repeat this until you run out of MWU elements.

You can make this algorithm more efficient by bundling the entire MWU
into a single statement and saving yourself the trouble of building
temporary tables. I was pressed for time so when my algorithm worked,
I stopped developing the program. That would be how you make it
faster.

The output table with the locations of the MWU in the corpus were very
useful to us, since we wanted to be able to use the corpus and our
statistics both with and without consideration of the MWU.

A detailed description (in German) of our methods for extracting
collocations is avalaible in the journal Zeitschrift für Romanische
Philologie [2005; 121 (1)] , "Kombinatorische Wortprofile und
Profilkontraste. Berechnungsverfahren und Anwendungen" by my
colleagues Peter Blumenthal (project director), Sascha Diwersy, and
Jörg Mielebacher.

Warm Regards,

Damon


PS: Vim rules! ;)
-- 

Damon Allen Davison
http://allolex.net



More information about the Corpora mailing list