[Corpora-List] News from the LDC

Tue Aug 26 20:15:24 UTC 2008

* -  Programmer Analyst Positions at LDC  -*
***

*

LDC2008T13
*-  BLLIP North American News Text, Complete 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T13>  -
*LDC2008T14
*-  BLLIP North American News Text, General Release 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T14>  -

*
LDC2008T15
*-  North American News Text, Complete 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T15>  -*
LDC2008T16
*-  North American News Text, General Release 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T16>  -

*

*T**he Linguistic Data Consortium (LDC) would to announce position 
openings for programmer analysts and the availability of new publications.
*

------------------------------------------------------------------------

*
*

*Programmer Analyst Positions at LDC
*

The LDC at the University of Pennsylvania has several immediate openings 
for full-time programmer analysts.

    *   Programmer Analyst - Text and Speech Annotation Support (#080725253)

      Duties: This position will support LDC's language resource 
creation projects by providing programming, technical and research 
support in a lead capacity.  Primary responsibilities will be to design, 
develop and implement programming solutions and oversee all technical 
aspects of the projects, work with LDC's project managers, annotators, 
programmers, and clients to develop achievable plans for corpus or 
software development and successfully execute them; write annotation 
tools, data processing tools, web applications and other software 
necessary for the projects; support annotation workflow; support 
end-users; investigate technical issues that may arise during the life 
cycles of projects, and provide timely solutions to them as necessary.

    *    Programmer Analyst - Arabic Treebank (#080324301)

      Duties: Same as above; this position will primarily work on Arabic 
Treebank and other Arabic-related projects. (Grammatical knowledge and 
reading ability of the Arabic language highly preferred for this position.)

    *   Programmer Analyst - External Relations (#080725188)

      Duties: This position will support LDC's External Relations Group 
by designing. developing, coding and providing support for LDC's 
business systems. The business systems support the organization's 
membership and sales activities and time tracking; features include 
invoicing, member tracking and reporting functions.  This position will 
also coordinate and prepare publications of language resources -- such 
as video computer-readable speech, and software and text data -- used for
human language technology research and technology development.

For further information on the duties and qualifications for these 
positions, or to apply online please visit http://jobs.hr.upenn.edu/; 
search postings for the reference numbers indicated above.

Penn offers an excellent benefits package including medical/dental, 
retirement plans, tuition assistance and a minimum of 3 weeks paid 
vacation per year. The  University of Pennsylvania is an affirmative 
action/equal opportunity employer.  Positions contingent upon funding.

For more information about LDC and the programs we support, visit 
http://www.ldc.upenn.edu/.

*New Publications*

(1) - (2) Brown Laboratory for Linguistic Information Processing (BLLIP) 
contains a Penn Treebank-style parsing of text from the North American 
News Text Corpus (LDC95T21) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21>. 
The North American News Text Corpus consists of English news text from 
the Los Angeles Times-Washington Post (1994-1997), the New York Times 
(1994-1996), Reuters News Service (1994-1996) and the Wall Street 
Journal (1994-1996).

BLLIP North American News Text release is available as two versions: 
BLLIP North American News Text, Complete (LDC2008T13) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T13>, 
a Members-Only corpus that contain sentences from all sources in The 
North American News Text Corpus; and BLLIP North American News Text, 
General Release (LDC2008T14) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T14>, 
a corpus available to nonmembers that does not include the Wall Street 
Journal data from The North American News Text Corpus.

The data in this release was parsed into Penn Treebank-style parse trees 
using a re-ranking parser developed by Eugene Charniak and Mark Johnson. 
The Charniak and Johnson parser is statistically-based and uses a 
generative first stage followed by a discriminative second stage. Both 
stages were trained on the Wall Street Journal data in Treebank-2 
(LDC95T7) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7> 
and Treebank-3 (LDC99T42) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42>. 

In order to produce BLLIP North American News Text, the Charniak-Johnson 
parser used a simplified context free grammar in the first stage to 
generate a set of /n best/ parses. Those parses were then pruned by 
eliminating the parses at the edges of the distribution. In the second 
stage, a maximum entropy-based parser using a complete grammar was 
applied. The output trees are ranked in order of probability.  The 
parses in BLLIP North American News Text include constituency and POS 
tagging information for each of the 50-best parses of each sentence.  
Each file contains a sequence of n-best lists. An n-best list is a list 
of the top n parses of each sentence with the corresponding parser 
probability and re-ranker score. 

***

(3) - (4) North American News Text is a collection of English news text 
from the Los Angeles Times, Washington Post, New York Times, Reuters and 
the Wall Street Journal. This corpus was originally released in 1995 as 
the North American News Text Corpus (LDC95T21) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21>and 
is reissued to complement the release of the Brown Laboratory for 
Linguistic Information Processing (BLLIP) North American News Text sets 
(LDC2008T13, LDC2008T14), which consist of Penn Treebank-style parsing 
of that news text.

North American News Text is reissued in two versions: North American 
News Text, Complete (LDC2008T15) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T15>, 
the Members-Only original version, now available as a 2008 Membership 
Year corpus; and North American News Text, General Release (LDC2008T16) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T16>, 
a corpus available to nonmembers, which does not include text from the 
Wall Journal Street Journal. The directory structure of each of these 
publications has been restructured to be identical to the directory 
structure of the BLLIP releases.  The text content of each data file 
(following uncompression with the GNU-unzip utility) consists of plain 
ASCII character data with SGML tags to indicate article boundaries and 
organization of information within each article.

------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080826/2316dcab/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora