<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"
align="center">LDC2009T15<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15">GALE
Phase 1 Chinese Newsgroup Parallel Text - Part 1</a> -</b><br>
<br>
LDC2009T14<br>
<b>- <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14">Tagged
Chinese Gigaword Version 2.0</a> -</b><br>
<br>
The Linguistic Data Consortium (LDC) would like to announce the
availability of two new publications.<o:p></o:p></p>
<div class="MsoNormal" style="text-align: center;" align="center">
<hr align="center" size="2" width="100%"></div>
<div align="center">N<b>ew Publications</b><o:p></o:p>
</div>
<p class="MsoNormal"><b><br>
</b>(1) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15">GALE
Phase 1 Chinese Newsgroup Parallel Text - Part 1</a> contains 240,000
characters (112 files) of Chinese newsgroup text and its translation
selected
from twenty-five sources. Newsgroups consist of posts to electronic
bulletin
boards, Usenet newsgroups, discussion groups and similar forums. This
release
was used as training data in Phase 1 (year 1) of the DARPA-funded GALE.<o:p></o:p></p>
<p>Preparing the source data involved four stages of work: data
scouting, data
harvesting, formating and data selection.<o:p></o:p></p>
<p class="MsoNormal" style="">Data
scouting involved manually searching the web for suitable newsgroup
text. Data
scouts were assigned particular topics and genres along with a
production
target in order to focus their web search. Formal annotation guidelines
and a
customized annotation toolkit helped data scouts to manage the search
process
and to track progress. <o:p></o:p></p>
<p>Data scouts logged their decisions about potential text of interest
to a
database. A nightly process queried the annotation database and
harvested all
designated URLs. Whenever possible, the entire site was downloaded, not
just
the individual thread or post located by the data scout. Once the text
was
downloaded, its format was standardized so that the data could be more
easily
integrated into downstream annotation processes. Typically, a new
script was
required for each new domain name that was identified. After scripts
were run,
an optional manual process corrected any remaining formatting problems.<br>
<br>
The selected documents were then reviewed for content-suitability using
a
semi-automatic process. A statistical approach was used to rank a
document's
relevance to a set of already-selected documents labeled as "good."
An annotator then reviewed the list of relevance-ranked documents and
selected
those which were suitable for a particular annotation task or for
annotation in
general. These newly-judged documents in turn provided additional input
for the
generation of new ranked lists. <o:p></o:p></p>
<p class="MsoNormal" style="margin-bottom: 12pt;">Manual sentence
units/segments
(SU) annotation was also performed as part of the transcription task.
Three
types of end of sentence SU were identified: statement SU, question SU,
and
incomplete SU. After transcription and SU annotation, files were
reformatted
into a human-readable translation format and assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines which describe the makeup of the translation
team, the
source data format, the translation data format, best practices for
translating
certain linguistic features and quality control procedures applied to
completed
translations.<br>
<br>
<br>
</p>
<div align="center">*<br>
</div>
<p class="MsoNormal" style="margin-bottom: 12pt;"><o:p></o:p></p>
<p>(2) <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14">Tagged
Chinese Gigaword Version 2.0,</a> created by scholars at <a
href="http://www.sinica.edu.tw/main_e.shtml">Academia Sinica</a>, <st1:place><st1:city>Taipei</st1:city>,
<st1:country-region>Taiwan</st1:country-region></st1:place>, is a
part-of-speech tagged version of LDC's <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14">Chinese
Gigaword Second Edition (LDC2005T14)</a>. Like the original release,
Version
2.0 contains all of the data in Chinese Gigaword Second Edition -- from
Central
News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with
full part
of speech tags. In addition, this new release removes residual noises
in the
original and improves tagging accuracy by incorporating lexica of
unknown
words. The changes represented in Version 2.0 include the following: <o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style="">A single-width space is used
consistently between two segmented words. <o:p></o:p></li>
<li class="MsoNormal" style="">The position of the newline character
remains fixed, better reflecting the source files from <a
href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14">Chinese
Gigaword Second Edition (LDC2005T14)</a>. <o:p></o:p></li>
<li class="MsoNormal" style="">The original coding of partial Latin
letters or Arabic numerals is preserved. <o:p></o:p></li>
<li class="MsoNormal" style="">1,192 documents from Central News
Agency (<st1:country-region><st1:place>Taiwan</st1:place></st1:country-region>)
and 13 documents from Xinhua News Agency that were missing from the
first publication are included. <o:p></o:p></li>
<li class="MsoNormal" style="">A set of heuristics for building
out-of-vocabulary dictionaries to improve annotation quality of very
large corpora is incorporated. <o:p></o:p></li>
</ul>
<p>Documents in the corpus were assigned one of the following
categories:<o:p></o:p></p>
<ul type="disc">
<li class="MsoNormal" style=""><strong>story</strong>: This type of
DOC represents a coherent report on a particular topic or event,
consisting of paragraphs and full sentences. <o:p></o:p></li>
<li class="MsoNormal" style=""><strong>multi</strong>: This type of
DOC contains a series of unrelated "blurbs," each of which briefly
describes a particular topic or event; examples include "summaries of
today's news," "news briefs in ..." (some general area like finance or
sports), and so on. <o:p></o:p></li>
<li class="MsoNormal" style=""><strong>advis</strong>: These are
DOCs which the news service addresses to news editors; they are not
intended for publication to the "end users." <o:p></o:p></li>
<li class="MsoNormal" style=""><strong>other</strong>: These DOCs
clearly do not fall into any of the above types; they include items
such as lists of sports scores, stock prices, temperatures around the
world, and so on.<o:p></o:p></li>
</ul>
<p class="MsoNormal">Since neither manual
checking
nor automatic checking against a gold standard is feasible for gigaword
size
corpora, the authors proposed quality assurance of automatic annotation
of very
large corpora based on heterogeneous CKIP and ICTCLAS tagging systems
(Huang et
al., 2008). By comparing to word lists generated from the ICTCLAS
version of an
automatic tagged Xinhua portion of Chinese Gigaword, a set of
heuristics for
building out-of-vocabulary dictionaries to improve quality were
proposed.
Randomly selected texts for evaluating effects of these
out-of-vocabulary
dictionaries were manually checked. Experimental results indicate that
there
were 30,562 correct words (about 97.3 %) of tested words. <br>
<br>
<span style="color: black;"></span><o:p></o:p></p>
<hr size="2" width="100%">
<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya
Ahtaridis<br>
Membership Coordinator</big><br>
<br>
</small>--------------------------------------------------------------------</small><small><br>
</small></font></div>
<div align="center">
<pre class="moz-signature" cols="72"><font
face="Courier New, Courier, monospace">Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 <a
class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Philadelphia, PA 19104 USA <a
class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>
</div>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>