[Corpora-List] New Publications from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Thu Jun 25 16:14:29 UTC 2009


LDC2009T15
*-  GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15>  -*

LDC2009T14
*-  Tagged Chinese Gigaword Version 2.0 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14>  -*

The Linguistic Data Consortium (LDC) would like to announce the 
availability of two new publications.

------------------------------------------------------------------------
N*ew Publications*

*
*(1)  GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15> 
contains 240,000 characters (112 files) of Chinese newsgroup text and 
its translation selected from twenty-five sources. Newsgroups consist of 
posts to electronic bulletin boards, Usenet newsgroups, discussion 
groups and similar forums. This release was used as training data in 
Phase 1 (year 1) of the DARPA-funded GALE.

Preparing the source data involved four stages of work: data scouting, 
data harvesting, formating and data selection.

Data scouting involved manually searching the web for suitable newsgroup 
text. Data scouts were assigned particular topics and genres along with 
a production target in order to focus their web search. Formal 
annotation guidelines and a customized annotation toolkit helped data 
scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest to a 
database. A nightly process queried the annotation database and 
harvested all designated URLs. Whenever possible, the entire site was 
downloaded, not just the individual thread or post located by the data 
scout. Once the text was downloaded, its format was standardized so that 
the data could be more easily integrated into downstream annotation 
processes. Typically, a new script was required for each new domain name 
that was identified. After scripts were run, an optional manual process 
corrected any remaining formatting problems.

The selected documents were then reviewed for content-suitability using 
a semi-automatic process. A statistical approach was used to rank a 
document's relevance to a set of already-selected documents labeled as 
"good." An annotator then reviewed the list of relevance-ranked 
documents and selected those which were suitable for a particular 
annotation task or for annotation in general. These newly-judged 
documents in turn provided additional input for the generation of new 
ranked lists.

Manual sentence units/segments (SU) annotation was also performed as 
part of the transcription task. Three types of end of sentence SU were 
identified: statement SU, question SU, and incomplete SU. After 
transcription and SU annotation, files were reformatted into a 
human-readable translation format and assigned to professional 
translators for careful translation. Translators followed LDC's GALE 
Translation guidelines which describe the makeup of the translation 
team, the source data format, the translation data format, best 
practices for translating certain linguistic features and quality 
control procedures applied to completed translations.


*

(2)  Tagged Chinese Gigaword Version 2.0, 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14> 
created by scholars at Academia Sinica 
<http://www.sinica.edu.tw/main_e.shtml>, Taipei, Taiwan, is a 
part-of-speech tagged version of LDC's Chinese Gigaword Second Edition 
(LDC2005T14) 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>. 
Like the original release, Version 2.0 contains all of the data in 
Chinese Gigaword Second Edition -- from Central News Agency, Xinhua News 
Agency and Lianhe Zaobao -- annotated with full part of speech tags. In 
addition, this new release removes residual noises in the original and 
improves tagging accuracy by incorporating lexica of unknown words. The 
changes represented in Version 2.0 include the following:

    * A single-width space is used consistently between two segmented
      words.
    * The position of the newline character remains fixed, better
      reflecting the source files from Chinese Gigaword Second Edition
      (LDC2005T14)
      <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>.

    * The original coding of partial Latin letters or Arabic numerals is
      preserved.
    * 1,192 documents from Central News Agency (Taiwan) and 13 documents
      from Xinhua News Agency that were missing from the first
      publication are included.
    * A set of heuristics for building out-of-vocabulary dictionaries to
      improve annotation quality of very large corpora is incorporated.

Documents in the corpus were assigned one of the following categories:

    * *story*:   This type of DOC represents a coherent report on a
      particular topic or event, consisting of paragraphs and full
      sentences.
    * *multi*:   This type of DOC contains a series of unrelated
      "blurbs," each of which briefly describes a particular topic or
      event; examples include "summaries of today's news," "news briefs
      in ..." (some general area like finance or sports), and so on.
    * *advis*:   These are DOCs which the news service addresses to news
      editors; they are not intended for publication to the "end users."
    * *other*:   These DOCs clearly do not fall into any of the above
      types; they include items such as lists of sports scores, stock
      prices, temperatures around the world, and so on.

Since neither manual checking nor automatic checking against a gold 
standard is feasible for gigaword size corpora, the authors proposed 
quality assurance of automatic annotation of very large corpora based on 
heterogeneous CKIP and ICTCLAS tagging systems (Huang et al., 2008). By 
comparing to word lists generated from the ICTCLAS version of an 
automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics 
for building out-of-vocabulary dictionaries to improve quality were 
proposed. Randomly selected texts for evaluating effects of these 
out-of-vocabulary dictionaries were manually checked. Experimental 
results indicate that there were 30,562 correct words (about 97.3 %) of 
tested words.

------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator

--------------------------------------------------------------------

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
 Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090625/f38e68e0/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list