[Corpora-List] Most frequent 5K words in Icelandic?

Anton Karl Ingason ingason at ling.upenn.edu
Mon Nov 19 16:04:30 UTC 2012


Hi Kim,

You can use the IcePaHC corpus to extract these frequencies. Although it is
a historical corpus, it spans the period 12th-21st century, so you could
use the texts from, say, the 19th-21st centuries, which represent the
modern language well. IcePaHC is a free resource.

Note that the corpus is lemmatized and in addition to the treebank format,
the main download includes formats which are more convenient for your
purpose.

http://www.linguist.is/icelandic_treebank/Download

Unfortunately, it does not have English glosses, and I don't have any ideal
solution for that, but you might get something useful by loooking words up
in this list:
http://linguist.is/dictionary
(it uses a different tagset, and is quite limited, but it is also a free
resource)

The two tagsets you would be interested in are described in these pages:
http://www.linguist.is/icelandic_treebank/Tagset
http://linguist.is/icelandic_treebank/IFD_Tagset

There is an LREC paper on IcePaHC:
http://www.lrec-conf.org/proceedings/lrec2012/summaries/440.html

If you have any questions regarding IcePaHC, feel free to email me or any
other member of the IcePaHC project.

Best,
Anton


On Mon, Nov 19, 2012 at 9:05 AM, Thommy Mayer <thommy.mayer at gmail.com>wrote:

> Hi Kim,
>
> You could also check the "Frequency Dictionary Icelandic" from the
> Leipzig Wortschatz group or contact Uwe Quasthoff directly for the
> relevant data (quasthoff at informatik.uni-leipzig.de ).
>
> Quasthoff, Uwe, Sabine Fiedler, Erla Hallsteinsdóttir (ed.). 2012.
> Frequency Dictionary Icelandic (Íslensk tíðniorðabók). Band 3 der
> Reihe Frequency Dictionaries. Universitätsverlag, 109 S. (+CD-ROM).
>
> Regards,
> Thomas
>
> ---------------------------------------------------------------------------
> Thomas Mayer
> Research Unit "Quantitative Language Comparison"
> Forschungszentrum Deutscher Sprachatlas
> Philipps-Universität Marburg
> Hermann-Jacobsohn-Weg 3
> 35032 Marburg
>
> Current address:
> Geschwister Scholl Platz 1
> 80539 München, Germany
> Office: Schellingstraße 9, Raum 301
> Tel: +49 89 2180 6144
> ---------------------------------------------------------------------------
>
>
> 2012/11/19 Kim Witten <kimwitten at gmail.com>:
> > Hi Corpora Subscribers,
> > I'm wondering if somebody might be able to point me in the direction to
> find a simple list of the 5,000 most frequent words in Icelandic, from any
> (relatively current, non-historical) Icelandic corpus? With English gloss
> would be even better, but it's not necessary. Thanks!
> > -Kim
> > ---
> > Kim Witten, PhD candidate
> > Language & Linguistic Science
> > University of York, UK
> > kaw522 at york.ac.uk
> > www.MePhiD.com
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
www.linguist.is
tel: 215-350-7215
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121119/c04bbbea/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list