[Corpora-List] News from the LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Mon Oct 27 19:50:27 UTC 2008


-  *Programmer Analyst Position at LDC  -*

LDC2008T22
-  *Czech Academic Corpus 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22> * 
-

 LDC2008T19
-  *The New York Times Annotated Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19> * -

The Linguistic Data Consortium (LDC) would like to announce a programmer 
analyst opening and the availability of two new publications.

------------------------------------------------------------------------

*
*

*Programmer Analyst Position at LDC
*

The Linguistic Data Consortium (LDC) at the University of Pennsylvania, 
Philadelphia, PA has an immediate opening for a full-time programmer 
analyst.

Programmer Analyst -- Publications Programmer (#081025790)

Duties: Position will have primary responsibility for developing, 
implementing and managing data processing systems required to coordinate 
and prepare publications of language resources used for human language 
technology research and technology development.  Such resources include 
video, computer-readable speech, software and text data that are 
distributed via media and internet.  Position will  communicate with 
external data providers and internal project managers to acquire raw 
source material and to schedule releases; perform quality assessment of 
large data collections and render analyses/descriptions of their 
formats; create or adapt software tools to condition data to a uniform 
format and level of quality (e.g., eliminating corrupted data, 
normalizing data, etc.); validate quality control standards to published 
data and verify results; document initial and final data formats; review 
author documentation and supporting materials; create additional 
documentation as needed; and master and replicate publications. Position 
will also maintain the publications catalog system, the publications 
inventory, the archive of publishable and published data and the 
publication equipment, software and licenses.  Position requires 
attention to detail and is responsible for managing multiple short-term 
projects.

For further information on the duties and qualifications for this 
position, or to apply online please visit http://jobs.hr.upenn.edu/; 
search postings for the reference number indicated above.

Penn offers an excellent benefits package including medical/dental, 
retirement plans, tuition assistance and a minimum of 3 weeks paid 
vacation per year. The University of Pennsylvania is an affirmative 
action/equal opportunity employer.

Position contingent upon funding.  For more information about LDC and the programs we support, visit http://www.ldc.upenn.edu/.


*New Publications*

(1) The Prague family of annotated corpora has a new member, the Czech 
Academic Corpus 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22> 
(CAC 2.0). CAC 2.0 consists of 650,000 words from various 1970s and 
1980s newspapers, magazines and radio and television broadcast 
transcripts manually annotated for morphology and syntax. 

The CAC 2.0 offers:

    * For linguists: language material reflecting the real usage of the
      language.
    * For computational linguists: tools and a considerable amount of
      data for natural language applications that are not feasible
      without morphological and syntactical text processing.
    * For TrEd annotation tool users: the possibility to use voice
      control for the tool.
    * For teachers and their students: an interesting didactic tool for
      practicing Czech language morphology and syntax.

CAC 2.0 was created by a team from the Institute of the Czech Language, 
the Academy of Sciences of the Czech Republic.  The original purpose of 
the corpus was to build a frequency dictionary of the Czech language. 
Researchers were aware, however, that in order to make the CAC useful 
for future users, whether linguists or natural language processing 
systems developers, it was necessary to design annotation schemes and to 
develop tools that would add as much linguistic information as possible 
to the data. In 1996, the Prague Dependency Treebank (PDT) 
<http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/whatis.html>, which 
provided morphological and syntactic analytic layers of annotation to 
certain Czech media data, was launched independently of the CAC. During 
the work on the PDT's second version <http://ufal.mff.cuni.cz/pdt2.0/>, 
its researchers decided to transfer PDT's internal format and annotation 
scheme to the CAC with the goals of making the CAC and the PDT fully 
compatible and of integrating the CAC into the PDT. To that end, the CAC 
was manually annotated for morphology and syntax. CAC 2.0 adds the 
surface syntax annotation; in the terminology of the PDT, this 
annotation is called an analytical layer.

A morphological layer of annotation provides the word tokens with 
further data (annotation), which characterizes the morphological 
properties of the word tokens (as apparent in the lemma which is the 
canonical form of a lexeme), the part of speech, and morphological 
categories (case, number, tense, person, etc.). Formally, part of speech 
classes combine together with values of morphological categories to 
represent morphological tags (or, simply, tags). In the CAC 2.0, tags 
are designed according to the PDT as strings of definite length (15 
positions) where each position corresponds to a single category. 

In addition to CAC 2.0, the following PDT resources are available from 
LDC: Prague Dependency Treebank 1.0, LDC2001T10 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10>, 
Prague Dependency Treebank 2.0, LDC2006T01 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01>, 
Prague Arabic Dependency Treebank 1.0, LDC2004T23 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T23> 
and Prague Czech-English Dependency Treebank 1.0, LDC2004T25 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T25>


**
*


(2) The New York Times Annotated Corpus 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19> 
contains over 1.8 million articles written and published by the New York 
Times with article metadata provided by the New York Times Newsroom, the 
New York Times Indexing Service and the online production staff at 
nytimes.com The corpus also provides associated Java software tools for 
parsing corpus documents from .xml into a memory resident object. This 
rich archive will be useful for a number of linguistic-related research 
applications, including the development of automatic document 
summarization systems and automatic content extraction technology.

Highlights of the corpus include:

    * Over 1.8 million articles written and published between January 1,
      1987 and June 19, 2007.
    * Over 650,000 article summaries written by library scientists.
    * Over 1.5 million articles manually tagged by library scientists
      drawn from a normalized indexing vocabulary of people,
      organizations, locations and topic descriptors.
    * Over 275,000 algorithmically-tagged articles that have been hand
      verified by the online production staff at nytimes.com.
    * Java tools for parsing corpus documents from .xml into a memory
      resident object.

The corpus text is formatted in News Industry Text Format (NITF), an XML 
specification that provides a standardized representation for the 
content and structure of discrete news articles. NITF includes 
structural markup such as bylines, headlines and paragraphs. The format 
also provides management attributes for categorizing articles into 
topics, summarization usage restrictions and revision histories.

The New York Times has established a community website for researchers 
working on the data set at http://groups.google.com/group/nytnlp and 
encourages feedback and discussion about the corpus.


------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081027/3b3ec699/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list