[Corpora-List] seeking knorpora advice

Sat Oct 11 22:56:27 UTC 2003

Dear all,

I am in the process of creating a modified version of the knoppix live
cd for computational/corpus-based linguistics students.

As the knoppix page (www.knoppix.net) says, ''KNOPPIX is a bootable CD
with a collection of GNU/Linux software, automatic hardware detection,
and support for many graphics cards, sound cards, SCSI and USB devices
and other peripherals. KNOPPIX can be used as a Linux demo, educational
CD, rescue system, or adapted and used as a platform for commercial
software product demos. It is not necessary to install anything on a
hard disk. Due to on-the-fly decompression, the CD can have up to 2 GB
of executable software installed on it.''

Knoppix can be extremely useful for people who want to test or learn
linux, but do not want/cannot install it.

The modified version I am preparing will have a set of tools and data
that are specifically geared towards computational/corpus-based
linguists who want to try linux. I will make the iso image available on
my site.

What I would like to ask you is:

- what kind of programs/tools would you recommend for the cd (of
course, they must compile on linux)?
- what kind of data (corpora, word lists...) would you include in the
cd (I am particularly interested in freely distributable corpora)?

I am looking for things that are released under the GPL license or
similar, so that I will not have problems putting them on the cd.

In general, I would prefer easy-to-use, not-too-specialized programs
that work on the command line.

Some things I am planning to include:

- N-gram Statistics Package (http://www.d.umn.edu/~tpederse/nsp.html)
- K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
- WordNet (http://www.cogsci.princeton.edu/~wn/)
- ACOPOST collection of POS taggers (http://acopost.sourceforge.net/)
- bow toolkit (http://www-2.cs.cmu.edu/~mccallum/bow/)
- parts of the OPUS corpus (http://logos.uio.no/opus/)
- various perl modules that are useful for corpus/nlp work
- my own scripts for term extraction and downloading corpora from the
web

What else???

Of course, if somebody has already done something like this, I would be
very curious to hear about it.

If you are receiving this through the corpora list, please reply to me
directly. If there is interest, I will post a summary of the replies to
the list. And I will definitely let you know when the cd is ready.

Thanks in advance!

Regards,

Marco

---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni