On September 18, I wrote:<br><br> I'm having a conversation with a teacher looking for a CL project for<br> a senior in high school. Does anyone have experience with projects<br> that would be suitable for a student at that level? ("At that level"
<br> is pretty vague, but I think one could assume beginner-to-moderate<br> programming skill, a good level of energy and incentive, no specific<br> prior background at all in computational linguistics, and no LDC<br> membership.)
<br><br>The response has been wonderful. (Ok, it wasn't 25 people, but it<br>felt that way!) MANY thanks to:<br><br> Khurshid Ahmad, Steven Bird, Alex Boulton, Eugene Charniak, Robert<br> Dale, Steve Finch, Roeland Hancock, Rob Malouf, Chris Manning, Paul
<br> Johnston, Amruta Purandare, Raf Salkie, Diarmuid Ó Séaghdha, Harold<br> Somers, Amanda Stent, Eric Yeh<br><br>I *think* that covers everybody; apologies if any messages slipped<br>through the cracks.<br><br>Below I'm going to try to summarize the replies within some useful
<br>categories.<br><br>Cheers,<br><br> Philip<br><br>----------------<br><br><br>Existing learning/teaching materials and references<br><br>- NLTK (<a href="http://nltk.sourceforge.net">nltk.sourceforge.net</a>). Good source of code and project
<br> ideas, and it's also got a very nice collection of pre-processed<br> corpus materials, including a sampler of some of the LDC's greatest<br> hits. See especially:<br> o Nitin Madnani, Getting Started on Natural Language Processing with
<br> Python, ACM Crossroads Xrds13-4,<br> <a href="http://www.acm.org/crossroads/xrds13-4/natural_language.html">http://www.acm.org/crossroads/xrds13-4/natural_language.html</a>.<br> o Electronic Grammar modules (used with high school students):
<br> writing programs to solve practical problems with words, texts<br> and grammar. <a href="http://nltk.org/index.php/Electronic_Grammar">http://nltk.org/index.php/Electronic_Grammar</a>.<br> o The NLTK book,
<a href="http://nltk.org/index.php/Book">http://nltk.org/index.php/Book</a>, which includes over<br> 200 graded exercises along with introductions to programming and<br> NLP, some of which should be accessible to high school students.
<br><br>- The Computational Linguistics Olympiad<br> <a href="http://namclo.linguistlist.org/">http://namclo.linguistlist.org/</a>, in particular the sample problems,<br> <a href="http://namclo.linguistlist.org/problems.cfm">
http://namclo.linguistlist.org/problems.cfm</a><br><br>- CSLU Toolkit, <a href="http://cslu.cse.ogi.edu/toolkit/">http://cslu.cse.ogi.edu/toolkit/</a>. A comprehensive<br> suite of tools to enable exploration, learning, and research into
<br> speech and human-computer interaction.<br><br>- Ciezielska-Ciupek, M. 2001. Teaching with the internet and corpus<br> materials: Preparation of the ELT materials using the internet and<br> corpus resources. In Lewandowska-Tomaszczyk, B. (ed) PALC 2001:
<br> Practical Applications in Language Corpora. Lodz Studies in<br> Language, 7. Frankfurt: Peter Lang, p.521-531.<br><br>- Sun, Y-C. & Wang, L-Y. 2003. Concordancers in the EFL classroom:<br> Cognitive approaches and collocation difficulty. CALL, 16/1,
<br> p. 83-94.<br><br>- Using corpora in L1, Paul Thompson at the University of Reading has<br> worked with primary school children; Julia Blake & Tim Shortis in<br> secondary schools (cf their paper at BAAL 2007).
<br><br><br>Machine translation<br><br>- Implementing IBM Model 1<br><br>- Building a complete end-to-end statistical machine translation<br> system, e.g. using MOSES (<a href="http://www.statmt.org/wmt07/baseline.html">
http://www.statmt.org/wmt07/baseline.html</a>)<br><br><br>Supervised learning (e.g. using a Naive Bayes classifier)<br><br>- Word sense disambiguation<br><br>- Spam filtering (e.g. using spam message databases)<br><br>- Document classification (
e.g. using the 20 Newsgroups corpus)<br><br><br>Unsupervised techniques<br><br>- Implementing language models using the SRI LM toolkit<br><br>- Writing a bigram part of speech tagger, including Baum-Welch<br> training and Viterbi search.
<br><br>- Studying, critiquing and building a mini document ranking system<br> based on Page Rank.<br><br>- Odd one out: use simple similarity measures to pick the odd-one-out<br> from a given set of words. E.g., in (Honda, Toyota, Sony,
<br> BMW, Mercedes), Sony is the odd word (not a car company). Or, in<br> (India, China, Japan, Romania, Korea), Romania is the odd one (not<br> an asian country). The programming logic could be as simple as<br> extracting features for each word and then selecting a word as the
<br> "odd" if after removing it from the set, the remaining members share<br> the maximum number of features. Or, something more sophisticated<br> using cosine similarity measure that picks the word with the least
<br> cosine with the rest of the group as the Odd.<br><br><br>Corpus and grammar building/exploration<br><br>- Investigating some linguistic, sociolinguistic or stylistic aspect<br> of the student's choice in blogs or constructing a Web corpus.
<br> [Reading LanguageLog, <a href="http://www.languagelog.org">www.languagelog.org</a>, would probably be a great<br> start! -PSR]<br><br>- Building a small Web corpus and then doing collocation extraction or<br> text classification.
E.g. how do sports reports differ from music<br> reviews, or tabloid journalism from broadsheet journalism, or<br> Democrat authors from Republicans, or what do female bloggers write<br> about more frequently than male bloggers? [An exercise I wrote, at
<br> <a href="http://www.umiacs.umd.edu/~resnik/nlstat_tutorial_summer1998/Lab_ngrams.html">http://www.umiacs.umd.edu/~resnik/nlstat_tutorial_summer1998/Lab_ngrams.html</a>,<br> might be useful here. -PSR]<br><br>- Generating simple English sentences using a simple substitution
<br> based grammar. E.g. start by generating from a grammar like<br> "(the|a(n)) (big|little|smelly|argumentative) (cat|dog|teacher)<br> (ate|played with|jumped over|kicked|knew|typed on) (the|a(n))<br> (lazy|silly|old|fluffy|dusty|horrible) (white|fat|....)
<br> (fox|school|telephone|keyboard)", and then represent some<br> constraints as a filter over random replacements (i.e. if a random<br> replacement creates a violation of a constraint, make a new random<br> replacement). For example, foxes aren't dusty, schools aren't lazy
<br> and can't be eaten, keyboards can't be known, etc.<br><br>- Evaluating either the grammar checker or the readability statistics<br> that MS Word provides; then trying to design improvements, either as<br> a specification for a better piece of software, or as a real program
<br> which does some things automatically that MS Word can't do.<br><br>- Spidering parallel texts that are generated daily from the<br> EU, and then exploring translations.<br><br>- Writing a KWIC concordancer in python, to get them used to
<br> manipulating lots of text.<br><br>- Using the Sketch Engine and associated corpora<br> (<a href="http://www.sketchengine.co.uk/">http://www.sketchengine.co.uk/</a>), e.g. to compare and contrast<br> behaviour of "clever" vs. "intelligent" or "strong" vs. "powerful".
<br><br>- Using <a href="http://corpus.byu.edu/">http://corpus.byu.edu/</a> (formerly <a href="http://view.byu.edu">view.byu.edu</a>) to do similar<br> sorts of lexical explorations on material from the British National<br>
Corpus or Time Magazine corpus.<br><br>- Using the Linguist's Search Engine (<a href="http://lse.umiacs.umd.edu">lse.umiacs.umd.edu</a>) to explore<br> Web data by searching for syntactic structures.<br><br>- Writing or extending a grammar and evaluating its coverage
<br><br>- Surveying different approaches to parsing and writing a simple<br> definite clause grammar<br><br><br>Other<br><br>- Code-breaker exercise: given a text message, such as "meet me in the<br> park at 10", write a program that converts it into a cryptic code
<br> messege and a decoder that retrieves the original messege back. For<br> example, one idea is to use the odd-even scheme and display all the<br> odd characters first, followed by the even characters. This would<br>
generate a code messege: "MEE_EPTA_RMKE__AITN__1T0H". To decipher<br> this code, just read all the odd characters and then all the even<br> characters (treating spaces as regular characters). Alternatives,<br>
e.g. block code, character substitution, etc.<br><br><br>Other corpus suggestions<br><br>- Project Gutenberg <br>- Reuters RCV1 news corpus<br>- Enron e-mail corpus<br>- Wikipedia (downloadable as an XML file)<br>- Europarl parallel translations (
<a href="http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl/">http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl/</a>)<br>- Parallel Bibles and Web page translations (<a href="http://www.umiacs.umd.edu/~resnik/parallel/">
http://www.umiacs.umd.edu/~resnik/parallel/</a>)<br><br>