[Corpora-List] N-grams -> database
Mihail Kopotev
mihail.kopotev at helsinki.fi
Sun Oct 13 06:04:34 UTC 2013
You wrote:
> Hi Mikhail,
>
> On 11/10/13 22:16, Mihail Kopotev wrote:
>> I am wondering, if there is a standard way to covertthese n-gramsinto a
>> database?
> not that I'm aware of.
>
>> Technically, there is, of course, no problem to covertbut my question is
>> which indexes should be built and what should be stored as is without
>> any optimization.
>> And more specifically, does it make any sense to keep the whole tagsets,
>> or abetter way is to store each tagseparately?
> As always with databases, it depends on your application, i.e. on the
> kind of queries you are going to be asking. If you can tell us more
> about that, people may be able to give you concrete suggestions.
>
> Best,
> Kilian
>
Thank you Kilian!
A few words of explanation
Regarding the data:
- we have a large corpus of N-grams, for N = 2, ..., 6.
- there are millions of N-grams for each N.
- the language is Russian, i.e., it has rich morphology, with a complex
set of morpho-syntactic categories, each category having from 2 to 10
values.
- as mentioned, each N-gram/record in the corpus has the following
information: word-form_1 lemma_1 {POS + morpho-syntactic tags }_1
word-form_2 lemma_2 ...
Regarding the database organization:
- we would like to count a very wide range co-occurrence statistics. For
example, we might query: for all bi-grams, given that word_1 has
lemma=X, and word_2 has POS=Y, return the distribution of counts over
the values of category C in word_2. The main question is: what is the
optimal schema for the database to support such queries?
What should one N-gram database record look like? For example, for
word_1, we can allocate one field each for: the word-form, the lemma and
the POS. Then what about the morpho-tags? Should we have one field for
each category, to store the value for that category? Since the
categories are different for different parts of speech, does each record
allocate fields for ALL possible categories -- with most values being
NULL? That seems wasteful, will the database explode?
Or should the morpho-tags field be a pointer to a record in another
table, which then stores ONLY the appropriate categories (for this POS)
with their values? If so, will the indirection slow down the querying,
and by how much?
Or we could imagine a binarized representation: category 1 (having n_1
values) is coded as the first n_1 bits. If bit k is on, it means that
category 1 had value k -- of course, these n_1 bits are then mutually
exclusive (typically); then code the bits for the values of category 2,
in the next n_2 bits, and so on. This might be an easy way to encode
the data, but does SQL support efficient querying of such bit-arrays?
We imagine that this problem must have been encountered previously by
many people working with similar data. We would be grateful if we could
learn about your experience, what works and what doesn't.
A pointer to papers/reports describing this is, of course, as welcome.
Please reply to me, I will post a summary of replies (if any) to the list.
Best,
Mikhail Kopotev
--
Mikhail Kopotev, PhD, Adj.Prof.
University Lecturer
Department of Modern Languages
University of Helsinki
http://www.helsinki.fi/~kopotev
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list