[Corpora-List] N-grams -> database

Mark Davies Mark_Davies at byu.edu
Sun Oct 13 12:54:47 UTC 2013


The corpora from http://corpus.byu.edu/ are all stored in relational
databases, as are the n-grams that are based on these corpora
(http://www.ngrams.info).

My approach has been to create a "lexicon" of all of the unique "entries"
in the corpus, and then create SQL JOINs between that lexicon and the
actual n-grams. This is explained at
http://www.ngrams.info/samples_coca3.asp. Basically:

lexID = wordForm + lemma + PoS_tag , e.g.

2363	realized	realize	vvd
2629	realize	realize	vvi
8574	realizing	realize	vvg
9010	realization	realization	nn1

These lexID are what then appear in the n-grams databases. For example,
the top four entries for lexID [2363], where it is in the first position
of a 3-gram:

Freq w1 w2 w3
--- --- --- ---
895	2363	19	3
709	2363	19	11
612	2363	15	16
580	2363	19	23


Which correspond to:


895 // realized	realize	vvd // that	that	cat // the	the	at
709 // realized	realize	vvd // that	that	cat // I	i	ppis1
612 // realized	realize	vvd // it	it	pph1 // was	be	vbdz
580 // realized	realize	vvd // that	that	cat // he	he	pphs1

Four important points:

1. The best approach by far is to store the actual n-grams as integer
values (e.g. [2663 19 3] above). If you store these as actual word forms
(+ lemma + PoS), the queries will be much slower, unless the n-grams sets
are small (e.g. 10 million rows or less).

2. Now that the actual lexID values (in this case *unique* [wordForm +
lemma + PoS] ) are stored in the lexicon, you can really have as much
information as you want there, with no decrease in performance.

3. It is the SQL JOIN between the n-grams tables and the lexicon tables
that do most of the work.

4. *Clustered* indexes on the n-grams tables make all of the difference.
Since you only get one clustered index per table, put this on the column
(e.g. w1 or w2 or w3) where you think it would be most useful.
Alternatively, create multiple tables with the same [w1+w2+w3]
information, and then place the clustered index on [w1] in table1, on [w2]
in table2, etc.

Using this approach, I can query n-grams for a 450 million word database
(COCA: http://corpus.byu.edu/coca) -- using word form, lemma, PoS, or any
combination of these -- in just a second or two (for the most difficult
queries, e.g. [the + [nn1] ).

Again, a bit more info at: http://www.ngrams.info/samples_coca3.asp

Best,

Mark D.


====================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
====================================






On 10/13/13 12:04 AM, "Mihail Kopotev" <mihail.kopotev at helsinki.fi> wrote:

>You wrote:
>> Hi Mikhail,
>>
>> On 11/10/13 22:16, Mihail Kopotev wrote:
>>> I am wondering, if there is a standard way to covertthese n-gramsinto a
>>> database?
>> not that I'm aware of.
>>
>>> Technically, there is, of course, no problem to covertbut my question
>>>is
>>> which indexes should be built and what should be stored as is without
>>> any optimization.
>>> And more specifically, does it make any sense to keep the whole
>>>tagsets,
>>> or abetter way is to store each tagseparately?
>> As always with databases, it depends on your application, i.e. on the
>> kind of queries you are going to be asking. If you can tell us more
>> about that, people may be able to give you concrete suggestions.
>>
>> Best,
>> Kilian
>>
>Thank you Kilian!
>A few words of explanation
>
>Regarding the data:
>- we have a large corpus of N-grams, for N = 2, ..., 6.
>- there are millions of N-grams for each N.
>- the language is Russian, i.e., it has rich morphology, with a complex
>set of morpho-syntactic categories, each category having from 2 to 10
>values.
>- as mentioned, each N-gram/record in the corpus has the following
>information: word-form_1 lemma_1  {POS + morpho-syntactic tags }_1
>word-form_2 lemma_2 ...
>
>Regarding the database organization:
>- we would like to count a very wide range co-occurrence statistics. For
>example, we might query: for all bi-grams,  given that word_1 has
>lemma=X, and word_2 has POS=Y,  return the distribution of counts over
>the values of category C in word_2. The main question is: what is the
>optimal schema for the database to support such queries?
>
>What should one N-gram database record look like? For example, for
>word_1, we can allocate one field each for: the word-form, the lemma and
>the POS.  Then what about the morpho-tags?  Should we have one field for
>each category, to store the value for that category? Since the
>categories are different for different parts of speech, does each record
>allocate fields for ALL possible categories -- with most values being
>NULL?  That seems wasteful, will the database explode?
>
>Or should the morpho-tags field be a pointer to a record in another
>table, which then stores ONLY the appropriate categories (for this POS)
>with their values?  If so, will the indirection slow down the querying,
>and by how much?
>
>Or we could imagine a binarized representation: category 1 (having n_1
>values) is coded as the first n_1 bits.  If bit k is on, it means that
>category 1 had value k -- of course, these n_1 bits are then mutually
>exclusive (typically); then code the bits for the values of category 2,
>in the next n_2 bits, and so on.  This might be an easy way to encode
>the data, but does SQL support efficient querying of such bit-arrays?
>
>We imagine that this problem must have been encountered previously by
>many people working with similar data.  We would be grateful if we could
>learn about your experience, what works and what doesn't.
>
>A pointer to papers/reports describing this is, of course, as welcome.
>Please reply to me, I will post a summary of replies (if any) to the list.
>
>Best,
>Mikhail Kopotev
>
>-- 
>Mikhail Kopotev, PhD, Adj.Prof.
>University Lecturer
>Department of Modern Languages
>University of Helsinki
>http://www.helsinki.fi/~kopotev
>
>
>_______________________________________________
>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list