[Corpora-List] Are frequency lists of the most languages equivalent?

Wed Oct 12 17:15:49 UTC 2011

Hi Alexander
I think you may be underestimating the complexity of the problems involved in comparing frequency lists:
a) no corpus can be truly representative of any language; corpora are tiny samples
b) the corpora you are comparing would need to be reasonably similar in terms of size, contents, variety, vintage, etc
- there are very few such corpora publicly available, so you would probably have to create them;
you might find the BYU corpora - http://corpus.byu.edu/ - a useful place to start
c) the top 1000 words would not give you sufficient content words to make the kind of statements (about
level of development, education etc) that you aspire to; perhaps you might glance at
http://acorn.aston.ac.uk/SummerSchool2011/001-ramesh-sheffield-workshop2002.pdf
for some issues that would arise from just one corpus (the Bank of English): tokenization, lemmatization, neologisms, etc
d) unfortunately, frequency lists are not always publicly available , even for publicly available corpora
e) different corpus software will yield different frequency counts (dependent on tokenisation)
f) and of course you would need to be a reasonably expert user of each of the languages you are comparing
best
Ramesh Krishnamurthy
Visiting Academic Fellow, School of Languages and Social Sciences, Aston University, Birmingham B4 7ET
Room: NX01. Tel: 0121-204-3812.
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/
Corpus Analyst:
(a) GeWiss (Volkswagen Foundation) project: http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/
(b) Discourse of Climate Change: http://www1.aston.ac.uk/lss/research/research-projects/discourse-of-climate-change-project/
(c) Feminism: http://acorn.aston.ac.uk/projects.html
(d) COMENEGO (Corpus Multilingüe de Economía y Negocios) - Multilingual Corpus of Business and Economics: http://dti.ua.es/comenego
(e) European Phraseology Project: http://labidiomas3.ua.es/phraseology/login/login.php
-----------------------------

Message: 6

Date: Tue, 11 Oct 2011 17:37:53 +0200

From: Alexander Osherenko <osherenko at gmx.de<mailto:osherenko at gmx.de>>

Subject: Re: [Corpora-List] Are frequency lists of the most languages

      equivalent?

To: John D Burger <john at mitre.org<mailto:john at mitre.org>>

Cc: corpora <corpora at uib.no<mailto:corpora at uib.no>>

Or additional example:

If A's industrialization is much higher than B's you would find in A's 1000 top list such words as car, aircraft etc but not in B's top list because B's society doesn't know such subjects. I assume the same concerns the educational level of A and B.
--------------------------------

Message: 5

Date: Tue, 11 Oct 2011 17:09:32 +0200

From: Alexander Osherenko <osherenko at gmx.de<mailto:osherenko at gmx.de>>

Subject: Re: [Corpora-List] Are frequency lists of the most languages

      equivalent?

To: John D Burger <john at mitre.org<mailto:john at mitre.org>>

Cc: corpora <corpora at uib.no<mailto:corpora at uib.no>>

Exactly! Why do you say you don't understand what I'm talking about?

I wonder what factors can influence this similarity. For example, I supposed that besides grammar demographics do but there are thousand of indicators ( http://data.worldbank.org/indicator). Maybe somebody has already studied this issue.

2011/10/11 John D Burger <john at mitre.org<mailto:john at mitre.org>>

> I for one still do not know what you are talking about.  What do you

> mean by similar?  Can you operationalize this?  Do you mean something like:

>

>  90% of the words in language A's top 1000 by frequency  will be

> translated as one of language B's top 1000,  90% of the time.

>

> - John Burger

>  MITRE

>

>

> On Oct 10, 2011, at 11:24 , Alexander Osherenko wrote:

>

> > Maybe the better word for "equivalence" is "adequateness" or

> "similarity".

> >

> > I believe there are two types of variability (similarity) we are

> > talking

> about: George and Mike would study similarity at the grammatical

> level; Pete at the cognitive level. I suppose that every particular

> level has its drawbacks :( Semantic similarities between subjects

> provide a fascinating basis. However, some cultures do not have

> particular things and therefore no word for this subject. Grammars can be very different.

> >

> > Since languages are very different, it is probably not feasible to

> > find a

> "universal" frequency list. For this reason, I would simplify the

> discussion and limit it to the following question: What properties of

> two nationalities can be considered similar enough to entail a similar

> list of the most frequent words? The same grammar, realms, etc? In

> other words, given language A and language B, what properties of both

> languages (both grammatical and cognitive) influence the list of the

> most frequent words? I assume European languages can have similar

> lists of the most frequent languages because they have very similar

> realms; language grammar can be also similar.

> >

> > Marvelous examples can be Eastern Germany vs. Western Germany (both

> speaking the same language but having different realms; American English vs.

> British English). As Georgios said temporality plays a minor role in

> this discussion. How about geography? The list of the frequent words

> in the same same country at the both borders is the same?

> >

> > Alexander

> >

> > 2011/10/10 Georgios Mikros <gmikros at isll.uoa.gr<mailto:gmikros at isll.uoa.gr>> Dear Alexander,

> >

> > The 1000 most frequent words of most languages are mainly function

> > words

> and their frequency distribution can be predicted with reasonable

> accuracy using the Zipf?s law. In a number of experiments we have

> conducted in the early ?00 for Modern Greek [1]  we found that 90% of

> the 1000 most frequent words do not change even when we triple the

> size of the corpus (from 13Mwords to 33Mwords) and change considerably

> its topics and genres structure. So we are dealing probably with a

> lexical core which due to the grammatical character of its

> constituents (functional words) should be similar to most languages.

> >

> > Best

> >

> > George Mikros

> >

> >

> >

> > [1] Mikros, G., Hatzigeorgiu, N., & Carayannis, G. (2005). Basic

> quantitative characteristics of the Modern Greek Language using the

> Hellenic National Corpus. Journal of Quantitative Linguistics, 12(2-3), 167-184. doi:

> 10.1080/09296170500172478

> >

> >

> >

> > ____________________________

> >

> > George K. Mikros

> >

> > Associate Professor of Computational and Quantitative Linguistics

> >

> > Department of Italian Language and Literature

> >

> > School of Philosophy

> >

> > National and Kapodistrian University of Athens

> >

> > Panepistimioupoli Zografou, GR-15784

> >

> > Athens, Greece

> >

> > Tel: +30 210 7277491, +30 6976111742

> >

> > Email: gmikros at isll.uoa.gr<mailto:gmikros at isll.uoa.gr>

> >

> > Web: http://users.uoa.gr/~gmikros/

> >

> >

> >

> > From: corpora-bounces at uib.no<mailto:corpora-bounces at uib.no> [mailto:corpora-bounces at uib.no]<mailto:[mailto:corpora-bounces at uib.no]> On

> > Behalf

> Of Alexander Osherenko

> > Sent: Monday, October 10, 2011 2:23 PM

> > To: corpora at uib.no<mailto:corpora at uib.no>

> > Subject: [Corpora-List] Are frequency lists of the most languages

> equivalent?

> >

> >

> >

> > Hi all,

> >

> >

> > I am wondering if frequency lists of the most languages can be

> > considered

> as equivalent. For instance, consider an English frequency list such

> as the BNC frequency list

> (http://www.kilgarriff.co.uk/bnc-readme.html) and a German frequency

> list (http://german.about.com/library/blwfreq01.htm). The English

> frequency list starts with the definite article "the". The German one

> - with the definite article "der". Hence, the literal translation of the word "the" in German will result the word "der".

> >

> > Of course, it is not always enough to translate directly. However, I

> wouldn't wonder if say 70%-80% of the most frequent words in the most

> languages can be considered as equal. Notice I don't say the words

> should be also ordered in the same manner. For example, word "of"

> always comes before the word "appear". Nevertheless, I anticipate that words "of" and "appear"

> are present in the most frequent words of the most languages in every

> possible order even if particular language uses the word "appear" more

> often than the word "of".

> >

> > Alexander

> >

> > _______________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111012/c1a9e840/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora