[Corpora-List] Reducing n-gram output

Tue Oct 28 22:56:53 UTC 2008

Dahlmann Irina wrote:
> I was wondering whether anybody is aware of ideas and/or automated
> processes to reduce n-gram output by solving the common problem that
> shorter n-grams can be fragments of larger structures (e.g. the 5-gram
> 'at the end of the' as part of the 6-gram 'at the end of the day')
>
> I am only aware of Paul Rayson's work on c-grams (collapsed-grams).
>   
The c-gram approach just gives you some view at bigger n-grams that
contain some smaller n-gram of your choice, etc. The question is always,
what is your application or general idea.

IMHO, in general, such a view (plus statistical information and maybe
some symbol introduction system) is exploited in many grammar induction
systems that use e.g. alignment (different sequences of strings
occurring in the same context, or some context with different sequences
occurring in it). This reminds me of the notion of substitutability in
the structuralist tradition (e.g. Zelig Harris), or Alignment-based
learning (e.g. van Zaanen), and in some way also in the mentioned work
on e.g. morphology induction (e.g. Goldsmith).

For pure corpus analysis and visualization of n-gram relations this
might be the only relevant reference, i.e. Paul Rayson's c-grams.
Multigrams (used in some CL tasks, e.g. LID) might be related to this,
at least from the applied perspective.

DC

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora