[Corpora-List] ANC Bigrams and Trigrams

Nicolas Hernandez nicolas.hernandez at gmail.com
Mon Feb 14 13:16:37 UTC 2005


On Fri, 11 Feb 2005 14:42:18 -0500, Nancy Ide <ide at cs.vassar.edu> wrote:
> We are generating bigram and trigram data from the ANC First Release,
> which will very soon be available on the (new and improved) ANC
> website. We have a question for those who might be interested in this
> kind of data:  is it useful to generate the data for word pairs/triples
> that span sentence (or even paragraph) boundaries? Is there any
> advantage if we provide two sets of the bigram and trigram data, one
> that spans such boundaries and one that doesn't?

Dear Nancy,

Personally I have used n-grams to extract "meta-discourse expressions"
(basically frequent n-grams occurring in a corpus with a specific
genre). I was interested by punctuation marks, because they could give
me some contextual indications which could be used to select them".
For exemple :
"in this section" could have a different discourse interpretation at
the start (". In this section") and at the end of a sentence ("in this
section .") (depending on text genre).

According to me, it makes more accurrate statistical measures having
such ngrams.

/Nicolas

>
> Thanks,
> Nancy Ide
>
> =======================================================
>
> Nancy Ide
>
> Professor  of Computer Science
> Vassar College
> Poughkeepsie, NY 12604-0520 USA
> Tel: +1 845 437-5988 Fax: +1 845 437-7498
> ide at cs.vassar.edu
>
> Chercheur Associe
> Equipe Langue et Dialogue, LORIA/CNRS
> Campus Scientifique - BP 239
> 54506 Vandoeuvre-les-Nancy FRANCE
> Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
> ide at loria.fr
>
> =======================================================
>
>


--
Nicolas Hernandez
LIR - LIMSI
BP 133, 91403 Orsay Cedex
tel. 01 69 85 80 03, fax 01 69 85 80 88
IIE - CNAM
tel. 01 69 36 73 48



More information about the Corpora mailing list