[Corpora-List] Are frequency lists of the most languages equivalent?

Alexander Osherenko osherenko at gmx.de
Mon Oct 10 15:24:58 UTC 2011


Maybe the better word for "equivalence" is "adequateness" or "similarity".

I believe there are two types of variability (similarity) we are talking
about: George and Mike would study similarity at the grammatical level; Pete
at the cognitive level. I suppose that every particular level has its
drawbacks :( Semantic similarities between subjects provide a fascinating
basis. However, some cultures do not have particular things and therefore no
word for this subject. Grammars can be very different.

Since languages are very different, it is probably not feasible to find a
"universal" frequency list. For this reason, I would simplify the discussion
and limit it to the following question: What properties of two nationalities
can be considered similar enough to entail a similar list of the most
frequent words? The same grammar, realms, etc? In other words, given
language A and language B, what properties of both languages (both
grammatical and cognitive) influence the list of the most frequent words? I
assume European languages can have similar lists of the most frequent
languages because they have very similar realms; language grammar can be
also similar.

Marvelous examples can be Eastern Germany vs. Western Germany (both speaking
the same language but having different realms; American English vs. British
English). As Georgios said temporality plays a minor role in this
discussion. How about geography? The list of the frequent words in the same
same country at the both borders is the same?

Alexander

2011/10/10 Georgios Mikros <gmikros at isll.uoa.gr>

> Dear Alexander,****
>
> The 1000 most frequent words of most languages are mainly function words
> and their frequency distribution can be predicted with reasonable accuracy
> using the Zipf’s law. In a number of experiments we have conducted in the
> early ’00 for Modern Greek [1]  we found that 90% of the 1000 most frequent
> words do not change even when we triple the size of the corpus (from
> 13Mwords to 33Mwords) and change considerably its topics and genres
> structure. So we are dealing probably with a lexical core which due to the
> grammatical character of its constituents (functional words) should be
> similar to most languages.****
>
> Best****
>
> George Mikros****
>
> ** **
>
> [1] Mikros, G., Hatzigeorgiu, N., & Carayannis, G. (2005). Basic
> quantitative characteristics of the Modern Greek Language using the Hellenic
> National Corpus. Journal of Quantitative Linguistics, 12(2-3), 167-184. doi:
> 10.1080/09296170500172478****
>
> ** **
>
> ____________________________****
>
> George K. Mikros****
>
> Associate Professor of Computational and Quantitative Linguistics****
>
> Department of Italian Language and Literature****
>
> School of Philosophy****
>
> National and Kapodistrian University of Athens****
>
> Panepistimioupoli Zografou, GR-15784****
>
> Athens, Greece****
>
> Tel: +30 210 7277491, +30 6976111742****
>
> Email: gmikros at isll.uoa.gr    ****
>
> Web: http://users.uoa.gr/~gmikros/   ****
>
> ** **
>
> *From:* corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On Behalf
> Of *Alexander Osherenko
> *Sent:* Monday, October 10, 2011 2:23 PM
> *To:* corpora at uib.no
> *Subject:* [Corpora-List] Are frequency lists of the most languages
> equivalent?****
>
> ** **
>
> Hi all,****
> ** **
>
> I am wondering if frequency lists of the most languages can be considered
> as equivalent. For instance, consider an English frequency list such as the
> BNC frequency list (http://www.kilgarriff.co.uk/bnc-readme.html<http://www.linkedin.com/redirect?url=http%3A%2F%2Fwww%2Ekilgarriff%2Eco%2Euk%2Fbnc-readme%2Ehtml&urlhash=KPiq&_t=tracking_anet>)
> and a German frequency list (http://german.about.com/library/blwfreq01.htm<http://www.linkedin.com/redirect?url=http%3A%2F%2Fgerman%2Eabout%2Ecom%2Flibrary%2Fblwfreq01%2Ehtm&urlhash=99CW&_t=tracking_anet>).
> The English frequency list starts with the definite article "the". The
> German one - with the definite article "der". Hence, the literal translation
> of the word "the" in German will result the word "der".
>
> Of course, it is not always enough to translate directly. However, I
> wouldn't wonder if say 70%-80% of the most frequent words in the most
> languages can be considered as equal. Notice I don't say the words should be
> also ordered in the same manner. For example, word "of" always comes before
> the word "appear". Nevertheless, I anticipate that words "of" and "appear"
> are present in the most frequent words of the most languages in every
> possible order even if particular language uses the word "appear" more often
> than the word "of".****
>
> ** **
>
> Alexander****
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111010/5f92d565/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list