[Corpora-List] Are frequency lists of the most languages equivalent?

Alexander Osherenko osherenko at gmx.de
Tue Oct 11 15:09:32 UTC 2011


Exactly! Why do you say you don't understand what I'm talking about?

I wonder what factors can influence this similarity. For example, I supposed
that besides grammar demographics do but there are thousand of indicators (
http://data.worldbank.org/indicator). Maybe somebody has already studied
this issue.

2011/10/11 John D Burger <john at mitre.org>

> I for one still do not know what you are talking about.  What do you mean
> by similar?  Can you operationalize this?  Do you mean something like:
>
>  90% of the words in language A's top 1000 by frequency
>  will be translated as one of language B's top 1000,
>  90% of the time.
>
> - John Burger
>  MITRE
>
>
> On Oct 10, 2011, at 11:24 , Alexander Osherenko wrote:
>
> > Maybe the better word for "equivalence" is "adequateness" or
> "similarity".
> >
> > I believe there are two types of variability (similarity) we are talking
> about: George and Mike would study similarity at the grammatical level; Pete
> at the cognitive level. I suppose that every particular level has its
> drawbacks :( Semantic similarities between subjects provide a fascinating
> basis. However, some cultures do not have particular things and therefore no
> word for this subject. Grammars can be very different.
> >
> > Since languages are very different, it is probably not feasible to find a
> "universal" frequency list. For this reason, I would simplify the discussion
> and limit it to the following question: What properties of two nationalities
> can be considered similar enough to entail a similar list of the most
> frequent words? The same grammar, realms, etc? In other words, given
> language A and language B, what properties of both languages (both
> grammatical and cognitive) influence the list of the most frequent words? I
> assume European languages can have similar lists of the most frequent
> languages because they have very similar realms; language grammar can be
> also similar.
> >
> > Marvelous examples can be Eastern Germany vs. Western Germany (both
> speaking the same language but having different realms; American English vs.
> British English). As Georgios said temporality plays a minor role in this
> discussion. How about geography? The list of the frequent words in the same
> same country at the both borders is the same?
> >
> > Alexander
> >
> > 2011/10/10 Georgios Mikros <gmikros at isll.uoa.gr>
> > Dear Alexander,
> >
> > The 1000 most frequent words of most languages are mainly function words
> and their frequency distribution can be predicted with reasonable accuracy
> using the Zipf’s law. In a number of experiments we have conducted in the
> early ’00 for Modern Greek [1]  we found that 90% of the 1000 most frequent
> words do not change even when we triple the size of the corpus (from
> 13Mwords to 33Mwords) and change considerably its topics and genres
> structure. So we are dealing probably with a lexical core which due to the
> grammatical character of its constituents (functional words) should be
> similar to most languages.
> >
> > Best
> >
> > George Mikros
> >
> >
> >
> > [1] Mikros, G., Hatzigeorgiu, N., & Carayannis, G. (2005). Basic
> quantitative characteristics of the Modern Greek Language using the Hellenic
> National Corpus. Journal of Quantitative Linguistics, 12(2-3), 167-184. doi:
> 10.1080/09296170500172478
> >
> >
> >
> > ____________________________
> >
> > George K. Mikros
> >
> > Associate Professor of Computational and Quantitative Linguistics
> >
> > Department of Italian Language and Literature
> >
> > School of Philosophy
> >
> > National and Kapodistrian University of Athens
> >
> > Panepistimioupoli Zografou, GR-15784
> >
> > Athens, Greece
> >
> > Tel: +30 210 7277491, +30 6976111742
> >
> > Email: gmikros at isll.uoa.gr
> >
> > Web: http://users.uoa.gr/~gmikros/
> >
> >
> >
> > From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
> Of Alexander Osherenko
> > Sent: Monday, October 10, 2011 2:23 PM
> > To: corpora at uib.no
> > Subject: [Corpora-List] Are frequency lists of the most languages
> equivalent?
> >
> >
> >
> > Hi all,
> >
> >
> > I am wondering if frequency lists of the most languages can be considered
> as equivalent. For instance, consider an English frequency list such as the
> BNC frequency list (http://www.kilgarriff.co.uk/bnc-readme.html) and a
> German frequency list (http://german.about.com/library/blwfreq01.htm). The
> English frequency list starts with the definite article "the". The German
> one - with the definite article "der". Hence, the literal translation of the
> word "the" in German will result the word "der".
> >
> > Of course, it is not always enough to translate directly. However, I
> wouldn't wonder if say 70%-80% of the most frequent words in the most
> languages can be considered as equal. Notice I don't say the words should be
> also ordered in the same manner. For example, word "of" always comes before
> the word "appear". Nevertheless, I anticipate that words "of" and "appear"
> are present in the most frequent words of the most languages in every
> possible order even if particular language uses the word "appear" more often
> than the word "of".
> >
> > Alexander
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111011/57ecbe36/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list