[Corpora-List] Are frequency lists of the most languages equivalent?

Alexander Osherenko osherenko at gmx.de
Thu Oct 13 15:36:40 UTC 2011


Hi Ramesh,

I'm afraid we are still talking about different things: I talk about
similarity of frequency lists (whatever source) and what properties of their
languages influence the word choice, for example, geopolitical; you talk
about difficulties of collecting corpora what I don't dispute because it is
irrelevant.

However, I can comment on your list to clarify the problem I discuss.
a) "corpora are tiny samples" yes, they are. However, both the BNC frequency
list and, for example, the Weka stopwords list contain word "the".
Intuitively, I would also include "the". And you probably also. What can be
the reason that some words are included in a frequency list and some not? Is
the reason only grammatical or there are also other reasons, for example,
sociological?
b) "the corpora you are comparing...". As I said, I am not comparing
corpora. I am analyzing frequency lists of languages. I simply ignore the
source of the lists to simplify discussion.
c) "the top 1000 words would not give you sufficient content words..." - the
number is irrelevant. I could have taken 10.000 or 100.000. In my case, I
wonder what reasons can be significant for some particular compilation of a
frequency list. You mention content words what implies that you identify
grammar as a reason of influence. However, I assume that it is not
exhaustive and there are also other reasons of influence such as
demographic.
d) "unfortunately, frequency lists are not always publicly available..."
yes, it is no good. That's why I simplify the discussion and study ready
frequency lists and not their origin.
e) "different corpus software will yield..." yes. However, I assume that
whatever method is chosen for calculating frequency lists the main
conclusions can be drawn -- intuitively, word "the" will be always present
in a frequency list. Otherwise, such frequency list cant be considered as
trustworthy.
f) "and of course you would need to be a reasonably expert user" yes, it
would be nice. However, initially we can hypothesize about possible reasons
of similarities also without having such knowledge :)

A practical example. You have a text in language C and you want to find out
what languages can be destination languages for translation of this text.
You have two sets: a set with words from geographical region A and a set
with words from geographical area B. These sets PERFECTLY represent the
frequency lists of A and B. For instance, you want to translate a text
originally composed in German and have to decide what languages are most
appropriate for translation. You have to choose among two destination
languages: English (A) or a language of Indigenous peoples in Brazil
(B). What will be the reasons of the choice?

You will probably translate German text in English due to numerous reasons,
for example, because 1) grammar of both languages is similar; 2) both
countries have the same political organization: both England and Germany are
governed by a parliament (even if England is formally a monarchy); 3) both
countries are EU members; etc.  Hence, you would make life easier for
yourself and never experience problems in choice of appropriate words such
as car or airport: there are cars both in Germany and England, there are
airports both in Germany and England etc. In contrast, if you choose as
destination language a language of Indigenous peoples in Brazil you would
experience problems to explain basic notions. For example, what word would
you use in the language of Indigenous peoples in Brazil for federal
republic, or for monarchy? Conclusion: German text is more beneficial to
translate in English and a text of Indigenous peoples in Brazil should be
better translated in other language of Indigenous peoples in Brazil.

I hoped that somebody has already answered my question or can recommend me
something about it because I want to find all indicators of influence and
not the most evident that I mentioned (grammar, political organization). Can
you follow? Anyway, sooner or later (probably after collecting corpora ;-) )
this question will become emergent.

Best
Alexander

2011/10/12 Krishnamurthy, Ramesh <r.krishnamurthy at aston.ac.uk>

>  Hi Alexander****
>
> I think you may be underestimating the complexity of the problems involved
> in comparing frequency lists:****
>
> a) no corpus can be truly representative of any language; corpora are tiny
> samples****
>
> b) the corpora you are comparing would need to be reasonably similar in
> terms of size, contents, variety, vintage, etc****
>
> - there are very few such corpora publicly available, so you would probably
> have to create them;****
>
> you might find the BYU corpora - http://corpus.byu.edu/ - a useful place
> to start****
>
> c) the top 1000 words would not give you sufficient content words to make
> the kind of statements (about****
>
> level of development, education etc) that you aspire to; perhaps you might
> glance at ****
>
>
> http://acorn.aston.ac.uk/SummerSchool2011/001-ramesh-sheffield-workshop2002.pdf
> ****
>
> for some issues that would arise from just one corpus (the Bank of
> English): tokenization, lemmatization, neologisms, etc****
>
> d) unfortunately, frequency lists are not always publicly available , even
> for publicly available corpora****
>
> e) different corpus software will yield different frequency counts
> (dependent on tokenisation)****
>
> f) and of course you would need to be a reasonably expert user of each of
> the languages you are comparing****
>
> best****
>
> Ramesh Krishnamurthy****
>
> Visiting Academic Fellow, School of Languages and Social Sciences, Aston
> University, Birmingham B4 7ET****
>
> Room: NX01. Tel: 0121-204-3812.
> Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/
> ****
>
> Corpus Analyst:****
>
> (a) GeWiss (Volkswagen Foundation) project:
> http://www1.aston.ac.uk/lss/research/research-projects/gewiss-spoken-academic-discourse/
> ****
>
> (b) Discourse of Climate Change:
> http://www1.aston.ac.uk/lss/research/research-projects/discourse-of-climate-change-project/
> ****
>
> (c) Feminism: http://acorn.aston.ac.uk/projects.html****
>
> (d) COMENEGO (Corpus Multilingüe de Economía y Negocios) - Multilingual
> Corpus of Business and Economics: http://dti.ua.es/comenego****
>
> (e) European Phraseology Project:
> http://labidiomas3.ua.es/phraseology/login/login.php****
>
> -----------------------------****
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20111013/e51f28a2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list