<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

Linguistic Data Consortium has created detailed guidelines for manual

word alignment in Chinese-English and Arabic-English, for the DARPA

GALE program. Corpora developed under these guidelines will be

published in LDC's catalog in the coming months.

<br>

<br>

<a class="moz-txt-link-freetext"

 href="http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Arabic_alignment_guidelines_v4.0.pdf">http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Arabic_alignment_guidelines_v4.0.pdf</a>

<br>

<br>

<a class="moz-txt-link-freetext"

 href="http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Chinese_alignment_guidelines_v4.0.pdf">http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Chinese_alignment_guidelines_v4.0.pdf</a>

<br>

<br>

<a class="moz-txt-link-freetext"

 href="http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Chinese_WA_Tagging_Guidelines_V1.0.pdf">http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Chinese_WA_Tagging_Guidelines_V1.0.pdf</a>

<br>

<br>

We will also present two papers on our word alignment efforts at LREC

2010, which should be available in the proceedings and on LDC's

website.

<br>

<br>

Enriching Word Alignment with Linguistic Tags - Xuansong Li, Niyu Ge,

Stephen Grimes, Stephanie Strassel and Kazuaki Maeda

<br>

<br>

Creating Arabic-English Parallel Word-Aligned Treebank Corpora at LDC -

Stephen Grimes, Xuansong Li, Ann Bies, Seth Kulick, Xiaoyi Ma and

Stephanie Strassel

<br>

<br>

<br>

Nitin Madnani wrote:

<blockquote

 cite="mid:i2ie74869c11004160735y87817c80ud241d5464bac8a22@mail.gmail.com"

 type="cite">

  <pre wrap="">There has been work on creating gold-standard alignments. See the following:

(1) The annotation style guide for the Blinker project by Dan Melamed.

Even though this was written for the purpose of creating

English-French alignments using the Blinker tool, some of the

guidelines still carry over to the general case.

<a class="moz-txt-link-freetext" href="http://repository.upenn.edu/cgi/viewcontent.cgi?article=1054&context=ircs_reports">http://repository.upenn.edu/cgi/viewcontent.cgi?article=1054&context=ircs_reports</a>

(2) Annotation guidelines for creating paraphrase alignments by

Callison-Burch, Cohn and Lapata. Even though this guide is to help

create alignments between sentences in the same language (English), it

might still be useful.

<a class="moz-txt-link-freetext" href="http://www.dcs.shef.ac.uk/~tcohn/paraphrase_guidelines.pdf">http://www.dcs.shef.ac.uk/~tcohn/paraphrase_guidelines.pdf</a>

(3) A more comprehensive collection of word alignment guidelines can

be found on Rada Mihalcea's web page:

<a class="moz-txt-link-freetext" href="http://www.cse.unt.edu/~rada/wa/#guidelinesWA">http://www.cse.unt.edu/~rada/wa/#guidelinesWA</a>

Cheers,

Nitin

On Fri, Apr 16, 2010 at 1:20 AM, mohnish jadwani <a class="moz-txt-link-rfc2396E" href="mailto:mohnishgj@gmail.com"><mohnishgj@gmail.com></a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Respected Readers,

The need to create a Gold Standard Alignment of vital importance when one

has to evaluate results of bilingual corpus given to word alignment tools

like Giza++. This Gold Standard Alignment( Test Data ) as many of us know

serves as a reference against which one can evaluate the results obtained

using the Training data. For the creation of this test data which is a

subset of the Training Data, when one goes about it manually,  an individual

comes across lot of variations with respect source and target languages

while aligning words for e.g

1# 5 # does(1) he(2) go(3) home(4) ?(5) # 4 2 4 3 0

1# 5 # क्या(1) वह(2) घर(3) जाता(4) है(5) #

0 2 4 3 0

the word "does" maps to 'ता' of 'जाता'

There are many such careful considerations one has to keep in mind while

going about creation of Gold Standard Alignment.

Could you please suggest me any basic guidelines( if not

English-Hindi language specific ) that one could follow while going about

this, any reference paper or advice would be of great help.

Thanking You

Mohnish

_______________________________________________

Corpora mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

    </pre>

  </blockquote>

  <pre wrap=""><!---->

  </pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72"><span class="moz-txt-tag">-- 

</span>Stephanie Strassel

Senior Associate Director

Linguistic Data Consortium

3600 Market Street, Suite 810  Philadelphia, PA 19104-2653 USA

office: 215-898-9681

cell: 215-863-1813

fax: 215-573-2175

<a class="moz-txt-link-abbreviated" href="mailto:strassel@ldc.upenn.edu">strassel@ldc.upenn.edu</a>

<a class="moz-txt-link-freetext" href="http://www.ldc.upenn.edu">http://www.ldc.upenn.edu</a>

</pre>

</body>

</html>