[Corpora-List] ANC Bigrams and Trigrams

Nicolas Hernandez nicolas.hernandez at gmail.com
Mon Feb 14 13:16:37 UTC 2005

On Fri, 11 Feb 2005 14:42:18 -0500, Nancy Ide <ide at cs.vassar.edu> wrote:
> We are generating bigram and trigram data from the ANC First Release,
> which will very soon be available on the (new and improved) ANC
> website. We have a question for those who might be interested in this
> kind of data:  is it useful to generate the data for word pairs/triples
> that span sentence (or even paragraph) boundaries? Is there any
> advantage if we provide two sets of the bigram and trigram data, one
> that spans such boundaries and one that doesn't?

Dear Nancy,

Personally I have used n-grams to extract "meta-discourse expressions"
(basically frequent n-grams occurring in a corpus with a specific
genre). I was interested by punctuation marks, because they could give
me some contextual indications which could be used to select them".
For exemple :
"in this section" could have a different discourse interpretation at
the start (". In this section") and at the end of a sentence ("in this
section .") (depending on text genre).

According to me, it makes more accurrate statistical measures having
such ngrams.


> Thanks,
> Nancy Ide
> =======================================================
> Nancy Ide
> Professor  of Computer Science
> Vassar College
> Poughkeepsie, NY 12604-0520 USA
> Tel: +1 845 437-5988 Fax: +1 845 437-7498
> ide at cs.vassar.edu
> Chercheur Associe
> Equipe Langue et Dialogue, LORIA/CNRS
> Campus Scientifique - BP 239
> 54506 Vandoeuvre-les-Nancy FRANCE
> Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
> ide at loria.fr
> =======================================================

Nicolas Hernandez
BP 133, 91403 Orsay Cedex
tel. 01 69 85 80 03, fax 01 69 85 80 88
tel. 01 69 36 73 48

More information about the Corpora mailing list