[Corpora-List] Ratio of ambiguous tokens in Swedish, Danish and Norwegian

Joakim Nivre nivre at msi.vxu.se
Thu Feb 15 13:31:40 UTC 2007


Hi Hrafn,

You can find some statistics about Swedish in our article:

Nivre, J. and Grönqvist, L. (2001) Tagging a Corpus of Spoken Swedish. 
International Journal of Corpus Linguistics 6(1), 47-78.

A pre-print is available from my home page at:
http://w3.msi.vxu..se/~nivre/research/publ.html

The percentage of ambiguous tokens we get for the Stockholm-Umeå corpus is 
45.37. However, this is measured with the base tag set, consisting of only 
23 tags. With the full tag set, containing some 150 tags, the percentage 
will be higher. This is one of the reasons why it is very difficult to 
compare these figures across languages and corpora. You will find more 
details in the paper. (The first place to look is table 1.)

Best,
Joakim

On Thu, 15 Feb 2007, Hrafn Loftsson wrote:

> Hi everyone,
> 
>  
> 
> (It has been pointed out to me that, for some reason, my message to the
> list appeared empty in some e-mail systems.  Here is a second try:)
> 
>  
> 
> The paper: "J. Hajic (2000) Morphological tagging: Data vs.
> Dictionaries", reports percentages of ambiguous tokens for English
> (38.65%), Czech (45.97%), Estonian (40.24%), Hungarian (21.58%),
> Romanian (40.00%) and Slovene (38.01%), using an annotated version of
> Orwell's 1984 novel for each of these languages.
> 
>  
> 
> I need corresponding percentage number for Swedish, Danish and
> Norwegian, calculated using ANY corpora.
> 
>  
> 
> Does anyone have this info (and preferably a reference to a paper which
> discusses the issue)?
> 
>  
> 
> Regards,
> 
> Hrafn Loftsson
> 
> Assistant professor
> 
> Department of Computer Science
> 
> School of Science and Engineering
> 
> Reykjavik University
> 
> Iceland
> 
> 

==================================================================
Joakim Nivre

Växjö University		Uppsala University
School of Mathematics		Department of Linguistics
and Systems Engineering		and Philology
SE-35195 Växjö			Box 635, SE-75126 Uppsala

Tel: +46 470 708992		Tel: +46 18 4717009
Fax: +46 470 84004		Fax: +46 18 4711094
E-mail: nivre at msi.vxu.se	E-mail: joakim.nivre at lingfil.uu.se

URL: http://www.msi.vxu.se/users/nivre
==================================================================


More information about the Corpora mailing list