<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

</head>

<body bgcolor="#ffffff" text="#000000">

<p class="MsoNormal" style="margin-bottom: 12pt; text-align: center;"

 align="center">LDC2009T15<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15">GALE

Phase 1 Chinese Newsgroup Parallel Text - Part 1</a>  -</b><br>

<br>

LDC2009T14<br>

<b>-  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14">Tagged

Chinese Gigaword Version 2.0</a>  -</b><br>

<br>

The Linguistic Data Consortium (LDC) would like to announce the

availability of two new publications.<o:p></o:p></p>

<div class="MsoNormal" style="text-align: center;" align="center">

<hr align="center" size="2" width="100%"></div>

<div align="center">N<b>ew Publications</b><o:p></o:p>

</div>

<p class="MsoNormal"><b><br>

</b>(1)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T15">GALE

Phase 1 Chinese Newsgroup Parallel Text - Part 1</a> contains 240,000

characters (112 files) of Chinese newsgroup text and its translation

selected

from twenty-five sources. Newsgroups consist of posts to electronic

bulletin

boards, Usenet newsgroups, discussion groups and similar forums. This

release

was used as training data in Phase 1 (year 1) of the DARPA-funded GALE.<o:p></o:p></p>

<p>Preparing the source data involved four stages of work: data

scouting, data

harvesting, formating and data selection.<o:p></o:p></p>

<p class="MsoNormal" style="">Data

scouting involved manually searching the web for suitable newsgroup

text. Data

scouts were assigned particular topics and genres along with a

production

target in order to focus their web search. Formal annotation guidelines

and a

customized annotation toolkit helped data scouts to manage the search

process

and to track progress. <o:p></o:p></p>

<p>Data scouts logged their decisions about potential text of interest

to a

database. A nightly process queried the annotation database and

harvested all

designated URLs. Whenever possible, the entire site was downloaded, not

just

the individual thread or post located by the data scout. Once the text

was

downloaded, its format was standardized so that the data could be more

easily

integrated into downstream annotation processes. Typically, a new

script was

required for each new domain name that was identified. After scripts

were run,

an optional manual process corrected any remaining formatting problems.<br>

<br>

The selected documents were then reviewed for content-suitability using

a

semi-automatic process. A statistical approach was used to rank a

document's

relevance to a set of already-selected documents labeled as "good."

An annotator then reviewed the list of relevance-ranked documents and

selected

those which were suitable for a particular annotation task or for

annotation in

general. These newly-judged documents in turn provided additional input

for the

generation of new ranked lists. <o:p></o:p></p>

<p class="MsoNormal" style="margin-bottom: 12pt;">Manual sentence

units/segments

(SU) annotation was also performed as part of the transcription task.

Three

types of end of sentence SU were identified: statement SU, question SU,

and

incomplete SU. After transcription and SU annotation, files were

reformatted

into a human-readable translation format and assigned to professional

translators for careful translation. Translators followed LDC's GALE

Translation guidelines which describe the makeup of the translation

team, the

source data format, the translation data format, best practices for

translating

certain linguistic features and quality control procedures applied to

completed

translations.<br>

<br>

<br>

</p>

<div align="center">*<br>

</div>

<p class="MsoNormal" style="margin-bottom: 12pt;"><o:p></o:p></p>

<p>(2)  <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T14">Tagged

Chinese Gigaword Version 2.0,</a> created by scholars at <a

 href="http://www.sinica.edu.tw/main_e.shtml">Academia Sinica</a>, <st1:place><st1:city>Taipei</st1:city>,

<st1:country-region>Taiwan</st1:country-region></st1:place>, is a

part-of-speech tagged version of LDC's <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14">Chinese

Gigaword Second Edition (LDC2005T14)</a>. Like the original release,

Version

2.0 contains all of the data in Chinese Gigaword Second Edition -- from

Central

News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with

full part

of speech tags. In addition, this new release removes residual noises

in the

original and improves tagging accuracy by incorporating lexica of

unknown

words. The changes represented in Version 2.0 include the following: <o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style="">A single-width space is used

consistently between two segmented words. <o:p></o:p></li>

  <li class="MsoNormal" style="">The position of the newline character

remains fixed, better reflecting the source files from <a

 href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14">Chinese

Gigaword Second Edition (LDC2005T14)</a>. <o:p></o:p></li>

  <li class="MsoNormal" style="">The original coding of partial Latin

letters or Arabic numerals is preserved. <o:p></o:p></li>

  <li class="MsoNormal" style="">1,192 documents from Central News

Agency (<st1:country-region><st1:place>Taiwan</st1:place></st1:country-region>)

and 13 documents from Xinhua News Agency that were missing from the

first publication are included. <o:p></o:p></li>

  <li class="MsoNormal" style="">A set of heuristics for building

out-of-vocabulary dictionaries to improve annotation quality of very

large corpora is incorporated. <o:p></o:p></li>

</ul>

<p>Documents in the corpus were assigned one of the following

categories:<o:p></o:p></p>

<ul type="disc">

  <li class="MsoNormal" style=""><strong>story</strong>:   This type of

DOC represents a coherent report on a particular topic or event,

consisting of paragraphs and full sentences. <o:p></o:p></li>

  <li class="MsoNormal" style=""><strong>multi</strong>:   This type of

DOC contains a series of unrelated "blurbs," each of which briefly

describes a particular topic or event; examples include "summaries of

today's news," "news briefs in ..." (some general area like finance or

sports), and so on. <o:p></o:p></li>

  <li class="MsoNormal" style=""><strong>advis</strong>:   These are

DOCs which the news service addresses to news editors; they are not

intended for publication to the "end users." <o:p></o:p></li>

  <li class="MsoNormal" style=""><strong>other</strong>:   These DOCs

clearly do not fall into any of the above types; they include items

such as lists of sports scores, stock prices, temperatures around the

world, and so on.<o:p></o:p></li>

</ul>

<p class="MsoNormal">Since neither manual

checking

nor automatic checking against a gold standard is feasible for gigaword

size

corpora, the authors proposed quality assurance of automatic annotation

of very

large corpora based on heterogeneous CKIP and ICTCLAS tagging systems

(Huang et

al., 2008). By comparing to word lists generated from the ICTCLAS

version of an

automatic tagged Xinhua portion of Chinese Gigaword, a set of

heuristics for

building out-of-vocabulary dictionaries to improve quality were

proposed.

Randomly selected texts for evaluating effects of these

out-of-vocabulary

dictionaries were manually checked. Experimental results indicate that

there

were 30,562 correct words (about 97.3 %) of tested words. <br>

<br>

<span style="color: black;"></span><o:p></o:p></p>

<hr size="2" width="100%">

<div align="center"><font face="Courier New, Courier, monospace"><small><small><big>Ilya

Ahtaridis<br>

Membership Coordinator</big><br>

<br>

</small>--------------------------------------------------------------------</small><small><br>

</small></font></div>

<div align="center">

<pre class="moz-signature" cols="72"><font

 face="Courier New, Courier, monospace">Linguistic Data Consortium                     Phone: (215) 573-1275

University of Pennsylvania                       Fax: (215) 573-2175

3600 Market St., Suite 810                         <a

 class="moz-txt-link-abbreviated" href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>

 Philadelphia, PA 19104 USA                   <a

 class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a></font></pre>

</div>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>