[Corpora-List] News from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Mon Oct 27 19:50:27 UTC 2008
- *Programmer Analyst Position at LDC -*
LDC2008T22
- *Czech Academic Corpus 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22> *
-
LDC2008T19
- *The New York Times Annotated Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19> * -
The Linguistic Data Consortium (LDC) would like to announce a programmer
analyst opening and the availability of two new publications.
------------------------------------------------------------------------
*
*
*Programmer Analyst Position at LDC
*
The Linguistic Data Consortium (LDC) at the University of Pennsylvania,
Philadelphia, PA has an immediate opening for a full-time programmer
analyst.
Programmer Analyst -- Publications Programmer (#081025790)
Duties: Position will have primary responsibility for developing,
implementing and managing data processing systems required to coordinate
and prepare publications of language resources used for human language
technology research and technology development. Such resources include
video, computer-readable speech, software and text data that are
distributed via media and internet. Position will communicate with
external data providers and internal project managers to acquire raw
source material and to schedule releases; perform quality assessment of
large data collections and render analyses/descriptions of their
formats; create or adapt software tools to condition data to a uniform
format and level of quality (e.g., eliminating corrupted data,
normalizing data, etc.); validate quality control standards to published
data and verify results; document initial and final data formats; review
author documentation and supporting materials; create additional
documentation as needed; and master and replicate publications. Position
will also maintain the publications catalog system, the publications
inventory, the archive of publishable and published data and the
publication equipment, software and licenses. Position requires
attention to detail and is responsible for managing multiple short-term
projects.
For further information on the duties and qualifications for this
position, or to apply online please visit http://jobs.hr.upenn.edu/;
search postings for the reference number indicated above.
Penn offers an excellent benefits package including medical/dental,
retirement plans, tuition assistance and a minimum of 3 weeks paid
vacation per year. The University of Pennsylvania is an affirmative
action/equal opportunity employer.
Position contingent upon funding. For more information about LDC and the programs we support, visit http://www.ldc.upenn.edu/.
*New Publications*
(1) The Prague family of annotated corpora has a new member, the Czech
Academic Corpus 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22>
(CAC 2.0). CAC 2.0 consists of 650,000 words from various 1970s and
1980s newspapers, magazines and radio and television broadcast
transcripts manually annotated for morphology and syntax.
The CAC 2.0 offers:
* For linguists: language material reflecting the real usage of the
language.
* For computational linguists: tools and a considerable amount of
data for natural language applications that are not feasible
without morphological and syntactical text processing.
* For TrEd annotation tool users: the possibility to use voice
control for the tool.
* For teachers and their students: an interesting didactic tool for
practicing Czech language morphology and syntax.
CAC 2.0 was created by a team from the Institute of the Czech Language,
the Academy of Sciences of the Czech Republic. The original purpose of
the corpus was to build a frequency dictionary of the Czech language.
Researchers were aware, however, that in order to make the CAC useful
for future users, whether linguists or natural language processing
systems developers, it was necessary to design annotation schemes and to
develop tools that would add as much linguistic information as possible
to the data. In 1996, the Prague Dependency Treebank (PDT)
<http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/whatis.html>, which
provided morphological and syntactic analytic layers of annotation to
certain Czech media data, was launched independently of the CAC. During
the work on the PDT's second version <http://ufal.mff.cuni.cz/pdt2.0/>,
its researchers decided to transfer PDT's internal format and annotation
scheme to the CAC with the goals of making the CAC and the PDT fully
compatible and of integrating the CAC into the PDT. To that end, the CAC
was manually annotated for morphology and syntax. CAC 2.0 adds the
surface syntax annotation; in the terminology of the PDT, this
annotation is called an analytical layer.
A morphological layer of annotation provides the word tokens with
further data (annotation), which characterizes the morphological
properties of the word tokens (as apparent in the lemma which is the
canonical form of a lexeme), the part of speech, and morphological
categories (case, number, tense, person, etc.). Formally, part of speech
classes combine together with values of morphological categories to
represent morphological tags (or, simply, tags). In the CAC 2.0, tags
are designed according to the PDT as strings of definite length (15
positions) where each position corresponds to a single category.
In addition to CAC 2.0, the following PDT resources are available from
LDC: Prague Dependency Treebank 1.0, LDC2001T10
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10>,
Prague Dependency Treebank 2.0, LDC2006T01
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01>,
Prague Arabic Dependency Treebank 1.0, LDC2004T23
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T23>
and Prague Czech-English Dependency Treebank 1.0, LDC2004T25
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T25>
**
*
(2) The New York Times Annotated Corpus
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19>
contains over 1.8 million articles written and published by the New York
Times with article metadata provided by the New York Times Newsroom, the
New York Times Indexing Service and the online production staff at
nytimes.com The corpus also provides associated Java software tools for
parsing corpus documents from .xml into a memory resident object. This
rich archive will be useful for a number of linguistic-related research
applications, including the development of automatic document
summarization systems and automatic content extraction technology.
Highlights of the corpus include:
* Over 1.8 million articles written and published between January 1,
1987 and June 19, 2007.
* Over 650,000 article summaries written by library scientists.
* Over 1.5 million articles manually tagged by library scientists
drawn from a normalized indexing vocabulary of people,
organizations, locations and topic descriptors.
* Over 275,000 algorithmically-tagged articles that have been hand
verified by the online production staff at nytimes.com.
* Java tools for parsing corpus documents from .xml into a memory
resident object.
The corpus text is formatted in News Industry Text Format (NITF), an XML
specification that provides a standardized representation for the
content and structure of discrete news articles. NITF includes
structural markup such as bylines, headlines and paragraphs. The format
also provides management attributes for categorizing articles into
topics, summarization usage restrictions and revision histories.
The New York Times has established a community website for researchers
working on the data set at http://groups.google.com/group/nytnlp and
encourages feedback and discussion about the corpus.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20081027/3b3ec699/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list