[Corpora-List] News from the LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Aug 26 20:15:24 UTC 2008
* - Programmer Analyst Positions at LDC -*
***
*
LDC2008T13
*- BLLIP North American News Text, Complete
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T13> -
*LDC2008T14
*- BLLIP North American News Text, General Release
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T14> -
*
LDC2008T15
*- North American News Text, Complete
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T15> -*
LDC2008T16
*- North American News Text, General Release
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T16> -
*
*T**he Linguistic Data Consortium (LDC) would to announce position
openings for programmer analysts and the availability of new publications.
*
------------------------------------------------------------------------
*
*
*Programmer Analyst Positions at LDC
*
The LDC at the University of Pennsylvania has several immediate openings
for full-time programmer analysts.
* Programmer Analyst - Text and Speech Annotation Support (#080725253)
Duties: This position will support LDC's language resource
creation projects by providing programming, technical and research
support in a lead capacity. Primary responsibilities will be to design,
develop and implement programming solutions and oversee all technical
aspects of the projects, work with LDC's project managers, annotators,
programmers, and clients to develop achievable plans for corpus or
software development and successfully execute them; write annotation
tools, data processing tools, web applications and other software
necessary for the projects; support annotation workflow; support
end-users; investigate technical issues that may arise during the life
cycles of projects, and provide timely solutions to them as necessary.
* Programmer Analyst - Arabic Treebank (#080324301)
Duties: Same as above; this position will primarily work on Arabic
Treebank and other Arabic-related projects. (Grammatical knowledge and
reading ability of the Arabic language highly preferred for this position.)
* Programmer Analyst - External Relations (#080725188)
Duties: This position will support LDC's External Relations Group
by designing. developing, coding and providing support for LDC's
business systems. The business systems support the organization's
membership and sales activities and time tracking; features include
invoicing, member tracking and reporting functions. This position will
also coordinate and prepare publications of language resources -- such
as video computer-readable speech, and software and text data -- used for
human language technology research and technology development.
For further information on the duties and qualifications for these
positions, or to apply online please visit http://jobs.hr.upenn.edu/;
search postings for the reference numbers indicated above.
Penn offers an excellent benefits package including medical/dental,
retirement plans, tuition assistance and a minimum of 3 weeks paid
vacation per year. The University of Pennsylvania is an affirmative
action/equal opportunity employer. Positions contingent upon funding.
For more information about LDC and the programs we support, visit
http://www.ldc.upenn.edu/.
*New Publications*
(1) - (2) Brown Laboratory for Linguistic Information Processing (BLLIP)
contains a Penn Treebank-style parsing of text from the North American
News Text Corpus (LDC95T21)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21>.
The North American News Text Corpus consists of English news text from
the Los Angeles Times-Washington Post (1994-1997), the New York Times
(1994-1996), Reuters News Service (1994-1996) and the Wall Street
Journal (1994-1996).
BLLIP North American News Text release is available as two versions:
BLLIP North American News Text, Complete (LDC2008T13)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T13>,
a Members-Only corpus that contain sentences from all sources in The
North American News Text Corpus; and BLLIP North American News Text,
General Release (LDC2008T14)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T14>,
a corpus available to nonmembers that does not include the Wall Street
Journal data from The North American News Text Corpus.
The data in this release was parsed into Penn Treebank-style parse trees
using a re-ranking parser developed by Eugene Charniak and Mark Johnson.
The Charniak and Johnson parser is statistically-based and uses a
generative first stage followed by a discriminative second stage. Both
stages were trained on the Wall Street Journal data in Treebank-2
(LDC95T7)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T7>
and Treebank-3 (LDC99T42)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42>.
In order to produce BLLIP North American News Text, the Charniak-Johnson
parser used a simplified context free grammar in the first stage to
generate a set of /n best/ parses. Those parses were then pruned by
eliminating the parses at the edges of the distribution. In the second
stage, a maximum entropy-based parser using a complete grammar was
applied. The output trees are ranked in order of probability. The
parses in BLLIP North American News Text include constituency and POS
tagging information for each of the 50-best parses of each sentence.
Each file contains a sequence of n-best lists. An n-best list is a list
of the top n parses of each sentence with the corresponding parser
probability and re-ranker score.
***
(3) - (4) North American News Text is a collection of English news text
from the Los Angeles Times, Washington Post, New York Times, Reuters and
the Wall Street Journal. This corpus was originally released in 1995 as
the North American News Text Corpus (LDC95T21)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21>and
is reissued to complement the release of the Brown Laboratory for
Linguistic Information Processing (BLLIP) North American News Text sets
(LDC2008T13, LDC2008T14), which consist of Penn Treebank-style parsing
of that news text.
North American News Text is reissued in two versions: North American
News Text, Complete (LDC2008T15)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T15>,
the Members-Only original version, now available as a 2008 Membership
Year corpus; and North American News Text, General Release (LDC2008T16)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T16>,
a corpus available to nonmembers, which does not include text from the
Wall Journal Street Journal. The directory structure of each of these
publications has been restructured to be identical to the directory
structure of the BLLIP releases. The text content of each data file
(following uncompression with the GNU-unzip utility) consists of plain
ASCII character data with SGML tags to indicate article boundaries and
organization of information within each article.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080826/2316dcab/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list