[Corpora-List] CL projects suitable for a high-schooler?

Fri Sep 21 01:56:53 UTC 2007

On September 18, I wrote:

  I'm having a conversation with a teacher looking for a CL project for
  a senior in high school.  Does anyone have experience with projects
  that would be suitable for a student at that level?  ("At that level"
  is pretty vague, but I think one could assume beginner-to-moderate
  programming skill, a good level of energy and incentive, no specific
  prior background at all in computational linguistics, and no LDC
  membership.)

The response has been wonderful.  (Ok, it wasn't 25 people, but it
felt that way!)  MANY thanks to:

  Khurshid Ahmad, Steven Bird, Alex Boulton, Eugene Charniak, Robert
  Dale, Steve Finch, Roeland Hancock, Rob Malouf, Chris Manning, Paul
  Johnston, Amruta Purandare, Raf Salkie, Diarmuid Ó Séaghdha, Harold
  Somers, Amanda Stent, Eric Yeh

I *think* that covers everybody; apologies if any messages slipped
through the cracks.

Below I'm going to try to summarize the replies within some useful
categories.

Cheers,

  Philip

----------------

Existing learning/teaching materials and references

- NLTK (nltk.sourceforge.net). Good source of code and project
  ideas, and it's also got a very nice collection of pre-processed
  corpus materials, including a sampler of some of the LDC's greatest
  hits.  See especially:
    o Nitin Madnani, Getting Started on Natural Language Processing with
      Python, ACM Crossroads Xrds13-4,
      http://www.acm.org/crossroads/xrds13-4/natural_language.html.
    o Electronic Grammar modules (used with high school students):
      writing programs to solve practical problems with words, texts
      and grammar. http://nltk.org/index.php/Electronic_Grammar.
    o The NLTK book, http://nltk.org/index.php/Book, which includes over
      200 graded exercises along with introductions to programming and
      NLP, some of which should be accessible to high school students.

- The Computational Linguistics Olympiad
  http://namclo.linguistlist.org/, in particular the sample problems,
  http://namclo.linguistlist.org/problems.cfm

- CSLU Toolkit, http://cslu.cse.ogi.edu/toolkit/.  A comprehensive
  suite of tools to enable exploration, learning, and research into
  speech and human-computer interaction.

- Ciezielska-Ciupek, M. 2001. Teaching with the internet and corpus
  materials: Preparation of the ELT materials using the internet and
  corpus resources. In Lewandowska-Tomaszczyk, B. (ed) PALC 2001:
  Practical Applications in Language Corpora. Lodz Studies in
  Language, 7. Frankfurt: Peter Lang, p.521-531.

- Sun, Y-C. & Wang, L-Y. 2003. Concordancers in the EFL classroom:
  Cognitive approaches and collocation difficulty. CALL, 16/1,
  p. 83-94.

- Using corpora in L1, Paul Thompson at the University of Reading has
  worked with primary school children; Julia Blake & Tim Shortis in
  secondary schools (cf their paper at BAAL 2007).

Machine translation

- Implementing IBM Model 1

- Building a complete end-to-end statistical machine translation
  system, e.g. using MOSES (http://www.statmt.org/wmt07/baseline.html)

Supervised learning (e.g. using a Naive Bayes classifier)

- Word sense disambiguation

- Spam filtering (e.g. using spam message databases)

- Document classification (e.g. using the 20 Newsgroups corpus)

Unsupervised techniques

- Implementing language models using the SRI LM toolkit

- Writing a bigram part of speech tagger, including Baum-Welch
  training and Viterbi search.

- Studying, critiquing and building a mini document ranking system
  based on Page Rank.

- Odd one out: use simple similarity measures to pick the odd-one-out
  from a given set of words. E.g., in (Honda, Toyota, Sony,
  BMW, Mercedes), Sony is the odd word (not a car company). Or, in
  (India, China, Japan, Romania, Korea), Romania is the odd one (not
  an asian country). The programming logic could be as simple as
  extracting features for each word and then selecting a word as the
  "odd" if after removing it from the set, the remaining members share
  the maximum number of features. Or, something more sophisticated
  using cosine similarity measure that picks the word with the least
  cosine with the rest of the group as the Odd.

Corpus and grammar building/exploration

- Investigating some linguistic, sociolinguistic or stylistic aspect
  of the student's choice in blogs or constructing a Web corpus.
  [Reading LanguageLog, www.languagelog.org, would probably be a great
  start! -PSR]

- Building a small Web corpus and then doing collocation extraction or
  text classification. E.g. how do sports reports differ from music
  reviews, or tabloid journalism from broadsheet journalism, or
  Democrat authors from Republicans, or what do female bloggers write
  about more frequently than male bloggers?  [An exercise I wrote, at

http://www.umiacs.umd.edu/~resnik/nlstat_tutorial_summer1998/Lab_ngrams.html
,
  might be useful here. -PSR]

- Generating simple English sentences using a simple substitution
  based grammar.  E.g. start by generating from a grammar like
  "(the|a(n)) (big|little|smelly|argumentative) (cat|dog|teacher)
  (ate|played with|jumped over|kicked|knew|typed on) (the|a(n))
  (lazy|silly|old|fluffy|dusty|horrible) (white|fat|....)
  (fox|school|telephone|keyboard)", and then represent some
  constraints as a filter over random replacements (i.e. if a random
  replacement creates a violation of a constraint, make a new random
  replacement).  For example, foxes aren't dusty, schools aren't lazy
  and can't be eaten, keyboards can't be known, etc.

- Evaluating either the grammar checker or the readability statistics
  that MS Word provides; then trying to design improvements, either as
  a specification for a better piece of software, or as a real program
  which does some things automatically that MS Word can't do.

- Spidering parallel texts that are generated daily from the
  EU, and then exploring translations.

- Writing a KWIC concordancer in python, to get them used to
  manipulating lots of text.

- Using the Sketch Engine and associated corpora
  (http://www.sketchengine.co.uk/), e.g. to compare and contrast
  behaviour of "clever" vs. "intelligent" or "strong" vs. "powerful".

- Using http://corpus.byu.edu/ (formerly view.byu.edu) to do similar
  sorts of lexical explorations on material from the British National
  Corpus or Time Magazine corpus.

- Using the Linguist's Search Engine (lse.umiacs.umd.edu) to explore
  Web data by searching for syntactic structures.

- Writing or extending a grammar and evaluating its coverage

- Surveying different approaches to parsing and writing a simple
  definite clause grammar

Other

- Code-breaker exercise: given a text message, such as "meet me in the
  park at 10", write a program that converts it into a cryptic code
  messege and a decoder that retrieves the original messege back. For
  example, one idea is to use the odd-even scheme and display all the
  odd characters first, followed by the even characters. This would
  generate a code messege: "MEE_EPTA_RMKE__AITN__1T0H". To decipher
  this code, just read all the odd characters and then all the even
  characters (treating spaces as regular characters).  Alternatives,
  e.g. block code, character substitution, etc.

Other corpus suggestions

- Project Gutenberg
- Reuters RCV1 news corpus
- Enron e-mail corpus
- Wikipedia (downloadable as an XML file)
- Europarl parallel translations (
http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl/)
- Parallel Bibles and Web page translations (
http://www.umiacs.umd.edu/~resnik/parallel/)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070920/bb21362d/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora