22.3413, FYI: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank

linguist at LINGUISTLIST.ORG linguist at LINGUISTLIST.ORG
Tue Aug 30 14:30:19 UTC 2011


LINGUIST List: Vol-22-3413. Tue Aug 30 2011. ISSN: 1068 - 4875.

Subject: 22.3413, FYI: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Veronika Drake, U of Wisconsin-Madison  
Monica Macaulay, U of Wisconsin-Madison  
Rajiv Rao, U of Wisconsin-Madison  
Joseph Salmons, U of Wisconsin-Madison  
Anja Wanner, U of Wisconsin-Madison  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Brent Miller <brent at linguistlist.org>
================================================================  
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.cfm.

===========================Directory==============================  

1)
Date: 29-Aug-2011
From: Joel Wallenberg [joel.wallenberg at gmail.com]
Subject: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank
 

	
-------------------------Message 1 ---------------------------------- 
Date: Tue, 30 Aug 2011 10:29:41
From: Joel Wallenberg [joel.wallenberg at gmail.com]
Subject: IcePaHC 0.9.: 1 Million Words, Icelandic Treebank

E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=22-3413.html&submissionid=4530587&topicid=6&msgnumber=1
  


We are very pleased to announce that version 0.9 of the Icelandic 
Parsed Historical Corpus (IcePaHC) is now available for free download. 

The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download

The corpus is a treebank of over 1 million words in size, annotated for 
full phrase structure parse, and hand-corrected, using an adaptation of 
the annotation scheme used by the Penn Treebank and the Penn 
parsed corpora of historical English (http://www.ling.upenn.edu/hist-
corpora/). Note that this release contains all of the text for version 1.0, 
but some minor corrections remain to be finished.

The corpus contains:

- 1 002 361 words total, consisting of ~100 000-word samples from 
each century from the 12th to the beginnng of the 21st century.
- Annotated with a phrase structure parse, part-of-speech-tagged, and 
lemmatized.
- The entire parse, pos-tagging, and lemmata for every sentence have 
been *hand-corrected*.
- Text samples are balanced for genre within each century.
- LGPL license: You are free to copy, modify and redistribute the 
corpus for research and/or profit with appropriate citation.

The corpus is distributed as raw UTF-8 data in labeled bracketing 
format and it is therefore compatible with various existing programs, 
including CorpusSearch (http://corpussearch.sourceforge.net/).  

A plain text version without markup and a set of info files containing 
philological information accompany the corpus download.

The entire corpus may be downloaded in a plain text version, a 
platform-independent GUI, and a Windows-compatible GUI for ease of 
searching.

Further information on the annotation guidelines and project 
organization can be found on the project wiki:
www.linguist.is/icelandic_treebank/


Joel C. Wallenberg (joel.wallenberg at gmail.com)
Anton Karl Ingason (anton.karl.ingason at gmail.com)
Einar Freyr Sigurðsson (einarfs at gmail.com)
Eiríkur Rögnvaldsson (eirikur at hi.is)
University of Iceland

We were grateful to receive support for this project through the 
following grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,''Viable 
Language Technology beyond English - Icelandic as a test case''.

U.S. National Science Foundation (NSF) International Research 
Fellowship Program (IRFP), grant #OISE-0853114, ''Evolution of 
Language Systems: a comparative study of grammatical change in 
Icelandic and English''.

University of Iceland Research Fund (Rannsóknasjóður Háskóla 
Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur 
trjábanki) 



Linguistic Field(s): Computational Linguistics
                     Historical Linguistics
                     Syntax
                     Text/Corpus Linguistics





 







-----------------------------------------------------------
LINGUIST List: Vol-22-3413	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.linguistlist.org/
					
					

	



More information about the LINGUIST mailing list