Corpora: Relatve text length

Alex Chengyu Fang alex_chengyu at yahoo.co.uk
Mon Apr 29 15:34:00 UTC 2002


What I wanted to say is that there are different ways
of measuring the relative length and that, if counts
of characters, syllables and morphemes are used, you
are likely to see differences between language pairs.
If, however, semantic proposition is used as key,
lanauges may not be so different as the number of
propositions should be a near constant across
multi-lingual texts that are mutual translations of
each other.

So, my simplistic view is that to see the differences,
use characters, syllables and morphemes as
measurements. To see similarities (the other
direction), the number of semantic propositions can
serve the purpose.

Regards,

Alex


 --- Yorick Wilks <yorick at dcs.shef.ac.uk> wrote: >
Sorry, I dont quite follow this--I thought the
> original question was
> just about length (whether text, characters,
> morphemes or words) and I
> didnt know when reading the question what the
> questioner's
> purpose was---I HOPE it wasnt language
> discrimination because Ramesh's
> figues show pretty clearly
> that  length (as words) doesnt separate Slavic
> languages like Czech from
> Estonian/Hungarian--though
> length as characters does a bit bette, although
> theres no separation
> from the Slavic family as  a  whole at all!
> None of that seems terribly simplistic ,just
> natural, given the question
> and answer
> (though which is unhelpful as it turns out).
>
> What iIdont follow is the link to alignment that you
> make--alignment is
> clearly interesting but
> what does it or can it say about the relative length
> of languages that
> the simpler counts do not?
> What is this 'other direction' you write of ----is
> it that, if you align
> at the sentence level
> many-one it says something about  some property of
> the languages that
> can distinguish them?
> Or won't all that depend on the existence and shared
> significance of
> punctuation marks--which seems a bit implausible?
> Regards
> Yorick Wilks
>
>
>
> Alex Chengyu Fang wrote:
>
> > Which measure to use depends on the purpose of the
> > study, whether to bring out differences or
> > similarities of the languages concerned.
> >
> > A rather simplistic view is that counds of words,
> > characters, syllables, morphemes etc tend to be
> used
> > to discriminate between languages. An attempt in
> the
> > other direction is the use of the number of
> > propositions to, for instance, automatically align
> > multilingual texts:
> >
> > Campbell, J. and A.C. Fang. 1995. Automated
> Alignment
> > in Multilingual Corpora. In Proceedings of the
> 10th
> > Pacific Asia Conference on Language, Information
> and
> > Computation (PACLIC10), 27-28 December 1995, Hong
> Kong
> > City University, Hong Kong. pp 185-193.
> >
> > Regards,
> >
> > Alex Fang
> >
> >  --- ramesh at ccl.bham.ac.uk wrote: > Dear Yorick
> > >
> > > Would morpheme counts not be even more accurate
> (or
> > > linguistically valid) than counting orthographic
> > > characters?
> > > Unfortunately, I don't think anyone has done
> these
> > > yet...
> > >
> > > Anyway, I agree that for the moment, character
> > > counts
> > > are a useful addition to word counts.
> > >
> > > Problems about translation (compensation,
> > > explication,
> > > zero translation, etc) obviously apply
> throughout.
> > >
> > > Here are some figures from my own research:
> > >
> > > 1. FIFA Laws in English, German, Spanish, and
> > > French.
> > > French is longest, then Spanish, German, and
> > > English.
> > >
> > > lines  words  characters   text
> > >
> > >   726  10216       56874   Laws97GB.txt
> > >   724   9173       63402   Laws97DE.txt
> > >  1342  11030       63765   Laws97SP.txt
> > >  1169  11763       67537   Laws97FR.txt
> > >
> > > 2. Canadian Hansard in English and French.
> > > French is longer in both samples.
> > >
> > >    lines   words  chars  text
> > >
> > >     1569   20336  104015 c1.001.E.A
> > >     1569   22413  124457 c1.002.F.A
> > >
> > >     1120   12260   62421 c2.002.E.A
> > >     1120   12135   62622 c2.003.F.A
> > >
> > > 3. George Orwell's 1984 (thanks to Multext-East
> and
> > > TELRI)
> > > in several languages. These figures were
> provided by
> > >
> > > Dr Tomaz Erjavec (Ljubljana) with various
> additional
> > > caveats:
> > >
> > >             line    word   char
> > >
> > > English    16053  102787  584803
> > > Bulgarian  11172   85878  536977
> > > Czech      11087   79022  498216
> > > Estonian   17872   78792  545984
> > > Hungarian   8813   79814  575219
> > > Romanian   16684  103704  603868
> > > Slovene    14938   91336  541461
> > >
> > > 4. Le Monde Diplomatique in English and Fench:
> > >
> > >      lines   words  characters   text
> > >
> > >      116     956    6410         LEMAE1.txt
> > >      133     941    7457         LEMAF1.txt
> > >
> > > 5. From research with Dr Maria Cristina Borba
> (Rio
> > > Grande, Brazil).
> > > Alice in Wonderland in English, 2
> > > Brazilian-Portuguese translations
> > > (one for adults, one for children), and a
> Catalan
> > > translation (MARIST).
> > >
> > >                          CARROLL       LEITE
> > > SEVCENKO       MARIST
> > >
> > > File length (bytes)      204,288     148,889
> > > 150,235      143,055
> > >
> > > Running words (tokens)    31,731      25,348
> > > 26,245       25,566
> > > Different words (types)    3,417       3,896
> > > 3,614        4,400
> > > type/token ratio (mean)    44.99%      51.61%
> > > 51.25%       51.19%
> > > ave. word length (letters)  3.63        4.36
> > > 4.31         4.16
> > >
> > > Best
> > > Ramesh
> > >
> > > Ramesh Krishnamurthy
> > > Honorary Research Fellow, University of
> Birmingham;
> > > Honorary Research Fellow, University of
> > > Wolverhampton;
> > > Consultant, Cobuild and Bank of English Corpus,
> > > Collins Dictionaries.
> > >
> > >
> > > On Thu, Apr 25, 2002 at 04:56:15PM +0100, Yorick
> > > Wilks wrote:
> > > > t=iso-8859-1
> > > > Content-Transfer-Encoding: 8bit
> > > > X-checked-clean: by exiscan on alf
> > > > X-Scanner: 5832cd47e7f9ea43fe3a076fe9cb1a70
> > > http://tjinfo.uib.no/virus.html
> > > > X-Spam-Flag: NO UIB: 0 hits, 8 required;
> > > > X-Spam-Report: spamassassin found:
> > > > Sender: owner-corpora at lists.uib.no
> > > > Precedence: bulk
> > > > Status: O
> > > > Content-Length: 3684
> > > > Lines: 114
> > > >
> > > >
> > > > Isnt there some  (minor) confusion here? If
> the
> > > question really is relative TEXT
> > > > length,
> > > > then nothing to do with word counts will
> settle
> > > it--what matters is character
> > > > counts, since word length
> > > > varies considerably between languages. The
> table
>
=== message truncated ===

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com



More information about the Corpora mailing list