<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<div align="center">-  <b>Programmer Analyst Position at LDC  -</b><br>

<br>

LDC2008T22 <br>

-  <b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22">Czech

Academic Corpus 2.0</a> </b> - <br>

<br>

 LDC2008T19 <br>

-  <b><a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19">The

New York Times Annotated Corpus</a> </b> -<br>

<br>

The Linguistic Data Consortium (LDC) would like to announce a

programmer analyst opening and the availability of two new publications.<br>

</div>

<br>

<hr size="2" width="100%">

<p align="center"><b><br>

</b></p>

<div align="center"><b>Programmer Analyst Position at LDC<br>

</b>

<p class="MsoNormal" align="left">The Linguistic Data Consortium (LDC)

at the <st1:place><st1:placetype>University</st1:placetype> of <st1:placename>Pennsylvania</st1:placename></st1:place>,

<st1:place><st1:city>Philadelphia</st1:city>, <st1:state>PA</st1:state></st1:place>

has an immediate opening for a full-time

programmer analyst.</p>

<p class="MsoNormal" align="left">Programmer Analyst – Publications

Programmer (#081025790)</p>

<p class="MsoBodyText" style="line-height: normal;" align="left"><span

 style="font-size: 12pt; font-weight: normal;">Duties: Position will

have primary responsibility for

developing, implementing and managing data processing systems required

to

coordinate and prepare publications of language resources used for

human

language technology research and technology development.<span style=""> 

</span>Such resources include video, computer-readable

speech, software and text data that are distributed via media and

internet.<span style="">  </span>Position will<span style="">  </span>communicate

with external data providers and

internal project managers to acquire raw source material and to

schedule

releases; perform quality assessment of large data collections and

render

analyses/descriptions of their formats; create or adapt software tools

to

condition data to a uniform format and level of quality (e.g.,

eliminating

corrupted data, normalizing data, etc.); validate quality control

standards to

published data and verify results; document initial and final data

formats;

review author documentation and supporting materials; create additional

documentation as needed; and master and replicate publications.

Position will

also maintain the publications catalog system, the publications

inventory, the

archive of publishable and published data and the publication

equipment,

software and licenses.<span style="">  </span>Position requires

attention to detail and is responsible for managing multiple short-term

projects.<o:p></o:p></span></p>

<div align="left">

<p><span style="font-size: 12pt; font-family: "Times New Roman";">For

further information on the duties and qualifications for this position,

or to apply online please visit <a href="http://jobs.hr.upenn.edu/">http://jobs.hr.upenn.edu/</a>;

search postings for the reference number indicated above.<o:p></o:p></span></p>

<p><span style="font-size: 12pt; font-family: "Times New Roman";">Penn

offers an excellent benefits package including medical/dental,

retirement plans, tuition assistance and a minimum of 3 weeks paid

vacation per year. The </span><st1:place><st1:placetype><span

 style="font-size: 12pt; font-family: "Times New Roman";">University</span></st1:placetype><span

 style="font-size: 12pt; font-family: "Times New Roman";"> of </span><st1:placename><span

 style="font-size: 12pt; font-family: "Times New Roman";">Pennsylvania</span></st1:placename></st1:place><span

 style="font-size: 12pt; font-family: "Times New Roman";"> is an

affirmative action/equal opportunity employer.<o:p></o:p></span></p>

<pre><span style="font-size: 12pt; font-family: "Times New Roman";">Position contingent upon funding.  </span><span

 style="font-size: 12pt; font-family: "Times New Roman";">For more information about LDC and the programs we support, visit <a

 href="http://www.ldc.upenn.edu/">http://www.ldc.upenn.edu/</a>.<o:p></o:p></span>

</pre>

</div>

<div align="left"><span

 style="font-size: 12pt; font-family: "Times New Roman";"><o:p></o:p></span></div>

<br>

</div>

<p style="margin-bottom: 12pt;"><o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center"><b>New

Publications</b><o:p></o:p></p>

<p>(1) The <st1:city><st1:place>Prague</st1:place></st1:city> family

of

annotated corpora has a new member, the <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T22">Czech

Academic Corpus 2.0</a> (CAC 2.0). CAC 2.0 consists of 650,000 words

from

various 1970s and 1980s newspapers, magazines and radio and television

broadcast transcripts manually annotated for morphology and syntax.  <o:p></o:p></p>

<p>The CAC 2.0 offers:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">For linguists: language material

reflecting the real usage of the language. <o:p></o:p></li>

  <li class="MsoNormal" style="">For computational linguists: tools and

a considerable amount of data for natural language applications that

are not feasible without morphological and syntactical text processing.

    <o:p></o:p></li>

  <li class="MsoNormal" style="">For TrEd annotation tool users: the

possibility to use voice control for the tool. <o:p></o:p></li>

  <li class="MsoNormal" style="">For teachers and their students: an

interesting didactic tool for practicing Czech language morphology and

syntax. <o:p></o:p></li>

</ul>

<p>CAC 2.0 was created by a team from the Institute of the Czech

Language, the <st1:place><st1:placetype>Academy</st1:placetype> of <st1:placename>Sciences</st1:placename></st1:place>

of the <st1:place><st1:placename>Czech</st1:placename> <st1:placetype>Republic</st1:placetype></st1:place>. 

The original

purpose of the corpus was to build a frequency dictionary of the Czech

language. Researchers were aware, however, that in order to make the

CAC useful

for future users, whether linguists or natural language processing

systems

developers, it was necessary to design annotation schemes and to

develop tools

that would add as much linguistic information as possible to the data.

In 1996,

<a href="http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/whatis.html">the

Prague Dependency Treebank (PDT)</a>, which provided morphological and

syntactic analytic layers of annotation to certain Czech media data,

was

launched independently of the CAC. During the work on the <a

 href="http://ufal.mff.cuni.cz/pdt2.0/">PDT's second version</a>, its

researchers decided to transfer PDT's internal format and annotation

scheme to

the CAC with the goals of making the CAC and the PDT fully compatible

and of

integrating the CAC into the PDT. To that end, the CAC was manually

annotated

for morphology and syntax. CAC 2.0 adds the surface syntax annotation;

in the

terminology of the PDT, this annotation is called an analytical layer.<o:p></o:p></p>

<p>A morphological layer of annotation provides the word tokens with

further

data (annotation), which characterizes the morphological properties of

the word

tokens (as apparent in the lemma which is the canonical form of a

lexeme), the

part of speech, and morphological categories (case, number, tense,

person,

etc.). Formally, part of speech classes combine together with values of

morphological categories to represent morphological tags (or, simply,

tags). In

the CAC 2.0, tags are designed according to the PDT as strings of

definite

length (15 positions) where each position corresponds to a single

category.  <o:p></o:p></p>

<p>In addition to CAC 2.0, the following PDT resources are available

from LDC: <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10">Prague

Dependency Treebank 1.0, LDC2001T10</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01">Prague

Dependency Treebank 2.0, LDC2006T01</a>, <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T23">Prague

Arabic Dependency Treebank 1.0, LDC2004T23</a> and <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T25">Prague

Czech-English Dependency Treebank 1.0, LDC2004T25</a><o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;"><br>

<o:p></o:p></p>

<p class="MsoNormal" style="text-align: center;" align="center"><b>*<br>

</b></p>

<p class="MsoNormal" style="text-align: center;" align="center"><br>

<o:p></o:p></p>

<p class="MsoNormal" style="">(2)

<a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19">The

New York Times Annotated Corpus</a> contains over 1.8 million articles

written

and published by the New York Times with article metadata provided by

the New

York Times Newsroom, the New York Times Indexing Service and the online

production

staff at nytimes.com The corpus also provides associated Java software

tools

for parsing corpus documents from .xml into a memory resident object.

This rich

archive will be useful for a number of linguistic-related research

applications, including the development of automatic document

summarization

systems and automatic content extraction technology.<o:p></o:p></p>

<p class="MsoNormal" style="">Highlights

of the corpus include:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">Over 1.8 million articles written and

published between <st1:date year="1987" day="1" month="1">January 1,

1987</st1:date> and <st1:date year="2007" day="19" month="6">June 19,

2007</st1:date>.<o:p></o:p></li>

  <li class="MsoNormal" style="">Over 650,000 article summaries written

by library scientists.<o:p></o:p></li>

  <li class="MsoNormal" style="">Over 1.5 million articles manually

tagged by library scientists drawn from a normalized indexing

vocabulary of people, organizations, locations and topic descriptors.<o:p></o:p></li>

  <li class="MsoNormal" style="">Over 275,000 algorithmically-tagged

articles that have been hand verified by the online production staff at

nytimes.com.<o:p></o:p></li>

  <li class="MsoNormal" style="">Java tools for parsing corpus

documents from .xml into a memory resident object.<o:p></o:p></li>

</ul>

<p class="MsoNormal" style="">The

corpus text is formatted in News Industry Text Format (NITF), an XML

specification that provides a standardized representation for the

content and

structure of discrete news articles. NITF includes structural markup

such as

bylines, headlines and paragraphs. The format also provides management

attributes for categorizing articles into topics, summarization usage

restrictions and revision histories. <o:p></o:p></p>

<p class="MsoNormal" style="">The

New York Times has established a community website for researchers

working on

the data set at <a href="http://groups.google.com/group/nytnlp">http://groups.google.com/group/nytnlp</a>

and encourages feedback and discussion about the corpus. <br>

</p>

<br>

<hr size="2" width="100%"><br>

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

</body>

</html>