Corpora: Counting semantic propositions (was Relatve text length)
Tadeusz Piotrowski
tadpiotr at plusnet.pl
Mon Apr 29 20:11:23 UTC 2002
I know some people love semantic propositions etc., but for me we are
back again in the world of Platonic ideas. I like this discussion group
because language is usually not regarded here as an ideal object. I must
confess I find counting (calculating) ideal objects like semantic
propositions a bit difficult. I find it difficult both as a researcher
and as a practising translator, and I reach for my Quine to find peace
of mind.
Regards
Tadeusz Piotrowski
> -----Original Message-----
> From: owner-corpora at lists.uib.no
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Alex Chengyu Fang
> Sent: Monday, April 29, 2002 5:34 PM
> To: Yorick Wilks
> Cc: ramesh at ccl.bham.ac.uk; corpora at hd.uib.no
> Subject: Re: Corpora: Relatve text length
>
>
> What I wanted to say is that there are different ways
> of measuring the relative length and that, if counts
> of characters, syllables and morphemes are used, you
> are likely to see differences between language pairs.
> If, however, semantic proposition is used as key,
> lanauges may not be so different as the number of
> propositions should be a near constant across
> multi-lingual texts that are mutual translations of
> each other.
>
> So, my simplistic view is that to see the differences,
> use characters, syllables and morphemes as
> measurements. To see similarities (the other
> direction), the number of semantic propositions can
> serve the purpose.
>
> Regards,
>
> Alex
>
>
> --- Yorick Wilks <yorick at dcs.shef.ac.uk> wrote: >
> Sorry, I dont quite follow this--I thought the
> > original question was
> > just about length (whether text, characters,
> > morphemes or words) and I
> > didnt know when reading the question what the
> > questioner's
> > purpose was---I HOPE it wasnt language
> > discrimination because Ramesh's
> > figues show pretty clearly
> > that length (as words) doesnt separate Slavic
> > languages like Czech from
> > Estonian/Hungarian--though
> > length as characters does a bit bette, although
> > theres no separation
> > from the Slavic family as a whole at all!
> > None of that seems terribly simplistic ,just
> > natural, given the question
> > and answer
> > (though which is unhelpful as it turns out).
> >
> > What iIdont follow is the link to alignment that you
> make--alignment
> > is clearly interesting but
> > what does it or can it say about the relative length
> > of languages that
> > the simpler counts do not?
> > What is this 'other direction' you write of ----is
> > it that, if you align
> > at the sentence level
> > many-one it says something about some property of
> > the languages that
> > can distinguish them?
> > Or won't all that depend on the existence and shared
> > significance of
> > punctuation marks--which seems a bit implausible?
> > Regards
> > Yorick Wilks
> >
> >
> >
> > Alex Chengyu Fang wrote:
> >
> > > Which measure to use depends on the purpose of the
> > > study, whether to bring out differences or
> > > similarities of the languages concerned.
> > >
> > > A rather simplistic view is that counds of words, characters,
> > > syllables, morphemes etc tend to be
> > used
> > > to discriminate between languages. An attempt in
> > the
> > > other direction is the use of the number of
> > > propositions to, for instance, automatically align multilingual
> > > texts:
> > >
> > > Campbell, J. and A.C. Fang. 1995. Automated
> > Alignment
> > > in Multilingual Corpora. In Proceedings of the
> > 10th
> > > Pacific Asia Conference on Language, Information
> > and
> > > Computation (PACLIC10), 27-28 December 1995, Hong
> > Kong
> > > City University, Hong Kong. pp 185-193.
> > >
> > > Regards,
> > >
> > > Alex Fang
> > >
> > > --- ramesh at ccl.bham.ac.uk wrote: > Dear Yorick
> > > >
> > > > Would morpheme counts not be even more accurate
> > (or
> > > > linguistically valid) than counting orthographic characters?
> > > > Unfortunately, I don't think anyone has done
> > these
> > > > yet...
> > > >
> > > > Anyway, I agree that for the moment, character
> > > > counts
> > > > are a useful addition to word counts.
> > > >
> > > > Problems about translation (compensation,
> > > > explication,
> > > > zero translation, etc) obviously apply
> > throughout.
> > > >
> > > > Here are some figures from my own research:
> > > >
> > > > 1. FIFA Laws in English, German, Spanish, and
> > > > French.
> > > > French is longest, then Spanish, German, and
> > > > English.
> > > >
> > > > lines words characters text
> > > >
> > > > 726 10216 56874 Laws97GB.txt
> > > > 724 9173 63402 Laws97DE.txt
> > > > 1342 11030 63765 Laws97SP.txt
> > > > 1169 11763 67537 Laws97FR.txt
> > > >
> > > > 2. Canadian Hansard in English and French.
> > > > French is longer in both samples.
> > > >
> > > > lines words chars text
> > > >
> > > > 1569 20336 104015 c1.001.E.A
> > > > 1569 22413 124457 c1.002.F.A
> > > >
> > > > 1120 12260 62421 c2.002.E.A
> > > > 1120 12135 62622 c2.003.F.A
> > > >
> > > > 3. George Orwell's 1984 (thanks to Multext-East
> > and
> > > > TELRI)
> > > > in several languages. These figures were
> > provided by
> > > >
> > > > Dr Tomaz Erjavec (Ljubljana) with various
> > additional
> > > > caveats:
> > > >
> > > > line word char
> > > >
> > > > English 16053 102787 584803
> > > > Bulgarian 11172 85878 536977
> > > > Czech 11087 79022 498216
> > > > Estonian 17872 78792 545984
> > > > Hungarian 8813 79814 575219
> > > > Romanian 16684 103704 603868
> > > > Slovene 14938 91336 541461
> > > >
> > > > 4. Le Monde Diplomatique in English and Fench:
> > > >
> > > > lines words characters text
> > > >
> > > > 116 956 6410 LEMAE1.txt
> > > > 133 941 7457 LEMAF1.txt
> > > >
> > > > 5. From research with Dr Maria Cristina Borba
> > (Rio
> > > > Grande, Brazil).
> > > > Alice in Wonderland in English, 2
> > > > Brazilian-Portuguese translations
> > > > (one for adults, one for children), and a
> > Catalan
> > > > translation (MARIST).
> > > >
> > > > CARROLL LEITE
> > > > SEVCENKO MARIST
> > > >
> > > > File length (bytes) 204,288 148,889
> > > > 150,235 143,055
> > > >
> > > > Running words (tokens) 31,731 25,348
> > > > 26,245 25,566
> > > > Different words (types) 3,417 3,896
> > > > 3,614 4,400
> > > > type/token ratio (mean) 44.99% 51.61%
> > > > 51.25% 51.19%
> > > > ave. word length (letters) 3.63 4.36
> > > > 4.31 4.16
> > > >
> > > > Best
> > > > Ramesh
> > > >
> > > > Ramesh Krishnamurthy
> > > > Honorary Research Fellow, University of
> > Birmingham;
> > > > Honorary Research Fellow, University of
> > > > Wolverhampton;
> > > > Consultant, Cobuild and Bank of English Corpus,
> > > > Collins Dictionaries.
> > > >
> > > >
> > > > On Thu, Apr 25, 2002 at 04:56:15PM +0100, Yorick
> > > > Wilks wrote:
> > > > > t=iso-8859-1
> > > > > Content-Transfer-Encoding: 8bit
> > > > > X-checked-clean: by exiscan on alf
> > > > > X-Scanner: 5832cd47e7f9ea43fe3a076fe9cb1a70
> > > > http://tjinfo.uib.no/virus.html
> > > > > X-Spam-Flag: NO UIB: 0 hits, 8 required;
> > > > > X-Spam-Report: spamassassin found:
> > > > > Sender: owner-corpora at lists.uib.no
> > > > > Precedence: bulk
> > > > > Status: O
> > > > > Content-Length: 3684
> > > > > Lines: 114
> > > > >
> > > > >
> > > > > Isnt there some (minor) confusion here? If
> > the
> > > > question really is relative TEXT
> > > > > length,
> > > > > then nothing to do with word counts will
> > settle
> > > > it--what matters is character
> > > > > counts, since word length
> > > > > varies considerably between languages. The
> > table
> >
> === message truncated ===
>
> __________________________________________________
> Do You Yahoo!?
> Everything you'll ever need on one web page
> from News and Sport to Email and Music Charts http://uk.my.yahoo.com
>
More information about the Corpora
mailing list