[Corpora-List] "Phrases in English" database -- new features
William Fletcher
fletcher at usna.edu
Sun Mar 28 22:41:37 UTC 2004
Apologies for cross-posting
Since its launch in December 2003, several new features have been added to the "Phrases in English" (PIE)website (see below for general information):
http://pie.usna.edu
-- "Explore POS-Grams" supports investigating Part Of Speech patterns by frequencies of Types or Tokens.
-- "Simple Search" for n-grams focusses user interface to reduce errors. Special features include:
- automatic checking and correction of multi-word units (of course > of_course; don't > do n't)
- "optional wildwords" for fuzzy searches (_the +{AJ?} ~{AJ?} days_ matches both _the good old days_ and _the good days_)
- "tamecard" search for hyphenated forms matches variants with a space and/or nothing (_data-base_ also matches _data base_ and _database_).
-- Click on any n-gram to see 50 concordances from the BNC, with information on source texts.
-- "Chargram", i.e. sequences of n characters, where n falls in the range 1-3. Occurrences of letter sequences can be explored either by position (initial, medial, final) or by frequency in types or tokens.
Various improvements have resulted directly from user suggestions. All feedback on these and other features will be received enthusiastically!
- - - - - - - - - - - - - - - - - - - - - - -
PIE incorporates a database of all 1-6-grams (phrases 1-6 "words" long) with part-of-speech (POS) codes occurring three or more times in the 100-million-word British National Corpus (BNC). One can explore English phraseology either through lists of forms and their frequencies or by searching for specific forms or collocations, e.g. 2-grams of the pattern "ADJ work", to find the most frequent adjectives describing _work_.
PIE also offers a phrase pattern discovery tool, "phrase-frames": sets of variants of an n-gram identical except for one word (wildcard symbol *). The most frequent and productive 4-frame is "the * of the", with variants such "as the end of the", "the rest of the", "the top of the", "the nature of the"*
Over the next year PIE will add:
-- Filtering by text type (domain, genre, target audience) for contrastive studies
-- Query by regular expression (currently only wildcards are supported)
In addition, when POS-tagging of the Michigan Corpus of Academic Spoken English (MICASE) http://www.hti.umich.edu/micase/ is complete, a similar database will be created with those data. Finally, when a substantial portion of the American National Corpus (ANC) http://americannationalcorpus.org has been released, a third parallel database will be built. Together these databases will permit comparative studies of phraseology in the principal variants of English.
Please note:
-- "Unfiltered" queries which match very large datasets can take a couple of minutes to complete. Please be patient; read the tutorials and FAQ to focus your queries.
-- Users who cannot access the above site may use
http://kwicfinder.com/BNC/ (please let me know so we can investigate)
Acknowledgements
Above all I am grateful to Michael Stubbs of the University of Trier for detailed suggestions and ongoing discussions that led to the creation and refinement of this site; even the "easy as pie" to remember acronym goes back to him. His research assistants contributed as well: Isabel Barth implemented the original phrase-frame generator and Katrin Ungeheuer offered valuable comments on organization and user-interface for query by text-type. Finally Lou Burnard of the BNC Consortium and David Lee of MICASE granted essential permissions and provided useful feedback on the site.
Bill Fletcher
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Sending an attachment? See below.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
AssocProf William H. Fletcher
Language Studies Department
United States Naval Academy
Annapolis MD 21402 5030
410-293-6362 [voice]
410-293-2729 [fax]
Department
http://usna.edu/LangStudy/
Phrases in English
http://pie.usna.edu/
KWiCFinder
http://kwicfinder.com/
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Don't worry about other people
stealing your ideas. If your ideas
are any good, you'll have to ram
them down people's throats.
--Howard Aiken
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Our mail server deletes messages with
certain kinds of attachments without
notifying the sender or recipient.
If sending a .doc, .exe or .zip file, please
rename it to delete the extension before
sending and let me know in the body
of the message what kind of file it is.
More information about the Corpora
mailing list