[Corpora-List] New Publications from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Thu Jun 25 16:14:29 UTC 2009
LDC2009T15
*- GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15> -*
LDC2009T14
*- Tagged Chinese Gigaword Version 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14> -*
The Linguistic Data Consortium (LDC) would like to announce the
availability of two new publications.
------------------------------------------------------------------------
N*ew Publications*
*
*(1) GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15>
contains 240,000 characters (112 files) of Chinese newsgroup text and
its translation selected from twenty-five sources. Newsgroups consist of
posts to electronic bulletin boards, Usenet newsgroups, discussion
groups and similar forums. This release was used as training data in
Phase 1 (year 1) of the DARPA-funded GALE.
Preparing the source data involved four stages of work: data scouting,
data harvesting, formating and data selection.
Data scouting involved manually searching the web for suitable newsgroup
text. Data scouts were assigned particular topics and genres along with
a production target in order to focus their web search. Formal
annotation guidelines and a customized annotation toolkit helped data
scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest to a
database. A nightly process queried the annotation database and
harvested all designated URLs. Whenever possible, the entire site was
downloaded, not just the individual thread or post located by the data
scout. Once the text was downloaded, its format was standardized so that
the data could be more easily integrated into downstream annotation
processes. Typically, a new script was required for each new domain name
that was identified. After scripts were run, an optional manual process
corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability using
a semi-automatic process. A statistical approach was used to rank a
document's relevance to a set of already-selected documents labeled as
"good." An annotator then reviewed the list of relevance-ranked
documents and selected those which were suitable for a particular
annotation task or for annotation in general. These newly-judged
documents in turn provided additional input for the generation of new
ranked lists.
Manual sentence units/segments (SU) annotation was also performed as
part of the transcription task. Three types of end of sentence SU were
identified: statement SU, question SU, and incomplete SU. After
transcription and SU annotation, files were reformatted into a
human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines which describe the makeup of the translation
team, the source data format, the translation data format, best
practices for translating certain linguistic features and quality
control procedures applied to completed translations.
*
(2) Tagged Chinese Gigaword Version 2.0,
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14>
created by scholars at Academia Sinica
<http://www.sinica.edu.tw/main_e.shtml>, Taipei, Taiwan, is a
part-of-speech tagged version of LDC's Chinese Gigaword Second Edition
(LDC2005T14)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>.
Like the original release, Version 2.0 contains all of the data in
Chinese Gigaword Second Edition -- from Central News Agency, Xinhua News
Agency and Lianhe Zaobao -- annotated with full part of speech tags. In
addition, this new release removes residual noises in the original and
improves tagging accuracy by incorporating lexica of unknown words. The
changes represented in Version 2.0 include the following:
* A single-width space is used consistently between two segmented
words.
* The position of the newline character remains fixed, better
reflecting the source files from Chinese Gigaword Second Edition
(LDC2005T14)
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>.
* The original coding of partial Latin letters or Arabic numerals is
preserved.
* 1,192 documents from Central News Agency (Taiwan) and 13 documents
from Xinhua News Agency that were missing from the first
publication are included.
* A set of heuristics for building out-of-vocabulary dictionaries to
improve annotation quality of very large corpora is incorporated.
Documents in the corpus were assigned one of the following categories:
* *story*: This type of DOC represents a coherent report on a
particular topic or event, consisting of paragraphs and full
sentences.
* *multi*: This type of DOC contains a series of unrelated
"blurbs," each of which briefly describes a particular topic or
event; examples include "summaries of today's news," "news briefs
in ..." (some general area like finance or sports), and so on.
* *advis*: These are DOCs which the news service addresses to news
editors; they are not intended for publication to the "end users."
* *other*: These DOCs clearly do not fall into any of the above
types; they include items such as lists of sports scores, stock
prices, temperatures around the world, and so on.
Since neither manual checking nor automatic checking against a gold
standard is feasible for gigaword size corpora, the authors proposed
quality assurance of automatic annotation of very large corpora based on
heterogeneous CKIP and ICTCLAS tagging systems (Huang et al., 2008). By
comparing to word lists generated from the ICTCLAS version of an
automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics
for building out-of-vocabulary dictionaries to improve quality were
proposed. Randomly selected texts for evaluating effects of these
out-of-vocabulary dictionaries were manually checked. Experimental
results indicate that there were 30,562 correct words (about 97.3 %) of
tested words.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090625/f38e68e0/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list