Corpora: Relatve text length
Yorick Wilks
yorick at dcs.shef.ac.uk
Mon Apr 29 14:36:06 UTC 2002
Sorry, I dont quite follow this--I thought the original question was
just about length (whether text, characters, morphemes or words) and I
didnt know when reading the question what the questioner's
purpose was---I HOPE it wasnt language discrimination because Ramesh's
figues show pretty clearly
that length (as words) doesnt separate Slavic languages like Czech from
Estonian/Hungarian--though
length as characters does a bit bette, although theres no separation
from the Slavic family as a whole at all!
None of that seems terribly simplistic ,just natural, given the question
and answer
(though which is unhelpful as it turns out).
What iIdont follow is the link to alignment that you make--alignment is
clearly interesting but
what does it or can it say about the relative length of languages that
the simpler counts do not?
What is this 'other direction' you write of ----is it that, if you align
at the sentence level
many-one it says something about some property of the languages that
can distinguish them?
Or won't all that depend on the existence and shared significance of
punctuation marks--which seems a bit implausible?
Regards
Yorick Wilks
Alex Chengyu Fang wrote:
> Which measure to use depends on the purpose of the
> study, whether to bring out differences or
> similarities of the languages concerned.
>
> A rather simplistic view is that counds of words,
> characters, syllables, morphemes etc tend to be used
> to discriminate between languages. An attempt in the
> other direction is the use of the number of
> propositions to, for instance, automatically align
> multilingual texts:
>
> Campbell, J. and A.C. Fang. 1995. Automated Alignment
> in Multilingual Corpora. In Proceedings of the 10th
> Pacific Asia Conference on Language, Information and
> Computation (PACLIC10), 27-28 December 1995, Hong Kong
> City University, Hong Kong. pp 185-193.
>
> Regards,
>
> Alex Fang
>
> --- ramesh at ccl.bham.ac.uk wrote: > Dear Yorick
> >
> > Would morpheme counts not be even more accurate (or
> > linguistically valid) than counting orthographic
> > characters?
> > Unfortunately, I don't think anyone has done these
> > yet...
> >
> > Anyway, I agree that for the moment, character
> > counts
> > are a useful addition to word counts.
> >
> > Problems about translation (compensation,
> > explication,
> > zero translation, etc) obviously apply throughout.
> >
> > Here are some figures from my own research:
> >
> > 1. FIFA Laws in English, German, Spanish, and
> > French.
> > French is longest, then Spanish, German, and
> > English.
> >
> > lines words characters text
> >
> > 726 10216 56874 Laws97GB.txt
> > 724 9173 63402 Laws97DE.txt
> > 1342 11030 63765 Laws97SP.txt
> > 1169 11763 67537 Laws97FR.txt
> >
> > 2. Canadian Hansard in English and French.
> > French is longer in both samples.
> >
> > lines words chars text
> >
> > 1569 20336 104015 c1.001.E.A
> > 1569 22413 124457 c1.002.F.A
> >
> > 1120 12260 62421 c2.002.E.A
> > 1120 12135 62622 c2.003.F.A
> >
> > 3. George Orwell's 1984 (thanks to Multext-East and
> > TELRI)
> > in several languages. These figures were provided by
> >
> > Dr Tomaz Erjavec (Ljubljana) with various additional
> > caveats:
> >
> > line word char
> >
> > English 16053 102787 584803
> > Bulgarian 11172 85878 536977
> > Czech 11087 79022 498216
> > Estonian 17872 78792 545984
> > Hungarian 8813 79814 575219
> > Romanian 16684 103704 603868
> > Slovene 14938 91336 541461
> >
> > 4. Le Monde Diplomatique in English and Fench:
> >
> > lines words characters text
> >
> > 116 956 6410 LEMAE1.txt
> > 133 941 7457 LEMAF1.txt
> >
> > 5. From research with Dr Maria Cristina Borba (Rio
> > Grande, Brazil).
> > Alice in Wonderland in English, 2
> > Brazilian-Portuguese translations
> > (one for adults, one for children), and a Catalan
> > translation (MARIST).
> >
> > CARROLL LEITE
> > SEVCENKO MARIST
> >
> > File length (bytes) 204,288 148,889
> > 150,235 143,055
> >
> > Running words (tokens) 31,731 25,348
> > 26,245 25,566
> > Different words (types) 3,417 3,896
> > 3,614 4,400
> > type/token ratio (mean) 44.99% 51.61%
> > 51.25% 51.19%
> > ave. word length (letters) 3.63 4.36
> > 4.31 4.16
> >
> > Best
> > Ramesh
> >
> > Ramesh Krishnamurthy
> > Honorary Research Fellow, University of Birmingham;
> > Honorary Research Fellow, University of
> > Wolverhampton;
> > Consultant, Cobuild and Bank of English Corpus,
> > Collins Dictionaries.
> >
> >
> > On Thu, Apr 25, 2002 at 04:56:15PM +0100, Yorick
> > Wilks wrote:
> > > t=iso-8859-1
> > > Content-Transfer-Encoding: 8bit
> > > X-checked-clean: by exiscan on alf
> > > X-Scanner: 5832cd47e7f9ea43fe3a076fe9cb1a70
> > http://tjinfo.uib.no/virus.html
> > > X-Spam-Flag: NO UIB: 0 hits, 8 required;
> > > X-Spam-Report: spamassassin found:
> > > Sender: owner-corpora at lists.uib.no
> > > Precedence: bulk
> > > Status: O
> > > Content-Length: 3684
> > > Lines: 114
> > >
> > >
> > > Isnt there some (minor) confusion here? If the
> > question really is relative TEXT
> > > length,
> > > then nothing to do with word counts will settle
> > it--what matters is character
> > > counts, since word length
> > > varies considerably between languages. The table
> > showed 1984 in Estonian as
> > > having far fewer word
> > > tokens in it than the English original, but I'd
> > bet theyre much longer
> > > ones--how about the texts then??
> > > I have no parallel texts with English and E.
> > European languages but I do with
> > > the four major W. European ones
> > > and the English pages are shorter in every case.
> > > Yorick Wilks
> > >
> > >
> > >
> > >
> > >
> > >
> > > James L. Fidelholtz" wrote:
> > >
> > > > Andrew and Spela:
> > > > Just a word of caution: studies like
> > Spela's provide interesting
> > > > and suggestive data, but figures will surely
> > vary, depending on the
> > > > translator, topic, etc. [all the usual
> > sociolinguistic caveats apply
> > > > here] (and note Jean's contribution, with
> > varying rates). I was
> > > > coauthor of a study comparing English and
> > Spanish, which basically tried
> > > > to get Spanish to fit into the standard
> > readability curves in a fairly
> > > > simple way. We were only partially successful
> > (the counts were
> > > > hand-done by yours truly, featuring a variety of
> > types of text,
> > > > pseudo-randomly sampled, and especially
> > translations from one
> > > > language to the other, as well as translations
> > from 3rd languages
> > > > [French & German] into each). To the best of my
> > recollection (I could
> > > > look up the exact figures if anyone is hot for
> > them), our results for
> > > > Spanish-English were rather close to Jean's for
> > French (I assume his
> > > > were on large amounts of text done by
> > computer--if this holds up [not
> > > > surprising, given the close relationship of
> > French and Spanish], it may
> > > > indicate that, for this kind of data, not such a
> > huge amount of text is
> > > > really necessary).
> > > >
> > > > On Wed, 24 Apr 2002, spela vintar wrote:
> > > >
> > > > >
> > > > >Hi Andrew,
> > > > >
> > > > >for Eastern-European languages you can compare
> > the lengths of Orwell's 1984
> > > > >and its translations that were collected within
> > the Multext-East project.
> > > > >The original Multext project
> > (http://www.lpl.univ-aix.fr/projects/multext/)
> > > > >should provide the same for English, German,
> > French, Spanish etc., however I
> > > > >wasn't able to find it on their homepage at
> > first glance...
> > > > >
> > > > >Best,
> > > > >Spela
> > > > >
> > > > >http://nl.ijs.si/ME/CD/docs/mte-d21f/node8.html
> > > > >//////////////
> > > > >...
> > > > >Below we give an estimate for the number of
> > words, by language. The
> > > > >wordcounts were produced by removing the SGML
> > tags from the texts and then
> > > > >using a 'wc'-like procedure.
> >
> === message truncated ===
>
> __________________________________________________
> Do You Yahoo!?
> Everything you'll ever need on one web page
> from News and Sport to Email and Music Charts
> http://uk.my.yahoo.com
More information about the Corpora
mailing list