[Corpora-List] N-grams -> database

Mihail Kopotev mihail.kopotev at helsinki.fi
Sun Oct 13 06:04:34 UTC 2013


You wrote:
> Hi Mikhail,
>
> On 11/10/13 22:16, Mihail Kopotev wrote:
>> I am wondering, if there is a standard way to covertthese n-gramsinto a
>> database?
> not that I'm aware of.
>
>> Technically, there is, of course, no problem to covertbut my question is
>> which indexes should be built and what should be stored as is without
>> any optimization.
>> And more specifically, does it make any sense to keep the whole tagsets,
>> or abetter way is to store each tagseparately?
> As always with databases, it depends on your application, i.e. on the
> kind of queries you are going to be asking. If you can tell us more
> about that, people may be able to give you concrete suggestions.
>
> Best,
> Kilian
>
Thank you Kilian!
A few words of explanation

Regarding the data:
- we have a large corpus of N-grams, for N = 2, ..., 6.
- there are millions of N-grams for each N.
- the language is Russian, i.e., it has rich morphology, with a complex 
set of morpho-syntactic categories, each category having from 2 to 10 
values.
- as mentioned, each N-gram/record in the corpus has the following 
information: word-form_1 lemma_1  {POS + morpho-syntactic tags }_1 
word-form_2 lemma_2 ...

Regarding the database organization:
- we would like to count a very wide range co-occurrence statistics. For 
example, we might query: for all bi-grams,  given that word_1 has 
lemma=X, and word_2 has POS=Y,  return the distribution of counts over 
the values of category C in word_2. The main question is: what is the 
optimal schema for the database to support such queries?

What should one N-gram database record look like? For example, for 
word_1, we can allocate one field each for: the word-form, the lemma and 
the POS.  Then what about the morpho-tags?  Should we have one field for 
each category, to store the value for that category? Since the 
categories are different for different parts of speech, does each record 
allocate fields for ALL possible categories -- with most values being 
NULL?  That seems wasteful, will the database explode?

Or should the morpho-tags field be a pointer to a record in another 
table, which then stores ONLY the appropriate categories (for this POS) 
with their values?  If so, will the indirection slow down the querying, 
and by how much?

Or we could imagine a binarized representation: category 1 (having n_1 
values) is coded as the first n_1 bits.  If bit k is on, it means that 
category 1 had value k -- of course, these n_1 bits are then mutually 
exclusive (typically); then code the bits for the values of category 2, 
in the next n_2 bits, and so on.  This might be an easy way to encode 
the data, but does SQL support efficient querying of such bit-arrays?

We imagine that this problem must have been encountered previously by 
many people working with similar data.  We would be grateful if we could 
learn about your experience, what works and what doesn't.

A pointer to papers/reports describing this is, of course, as welcome. 
Please reply to me, I will post a summary of replies (if any) to the list.

Best,
Mikhail Kopotev

-- 
Mikhail Kopotev, PhD, Adj.Prof.
University Lecturer
Department of Modern Languages
University of Helsinki
http://www.helsinki.fi/~kopotev


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list