[Corpora-List] New website "Phrases in English"

William Fletcher fletcher at usna.edu
Thu Dec 11 16:37:04 UTC 2003


Apologies for cross-posting

A new website, "Phrases in English" (PIE), has been launched:  
  http://pie.usna.edu 
While still under development, PIE already offers much to both linguists and students, and additional features will increase its scope in the future.  

PIE incorporates a database of all 1-6-grams (phrases 1-6 "words" long) with part-of-speech (POS) codes occurring three or more times in the 100-million-word British National Corpus (BNC).  One can explore English phraseology either through lists of forms and their frequencies or by searching for specific forms or collocations, e.g. 2-grams of the pattern "ADJ work", to find the most frequent adjectives describing work.

PIE also offers a phrase pattern discovery tool, "phrase-frames": sets of variants of an n-gram identical except for one word (wildcard symbol *). The most frequent and productive 4-frame is "the * of the", with variants such "as the end of the", "the rest of the", "the top of the", "the nature of the"* 

Over the next year PIE will add:

  -- Click on an n-gram in the query results to see concordances from the BNC

  -- POS-grams and POS-frames for studying the relative productivity of phrase structures 

  -- Filtering by text type (domain, genre, target audience) for contrastive studies

  -- Query by regular expression (currently only wildcards are supported)

In addition, when POS-tagging of the Michigan Corpus of Academic Spoken English (MICASE) http://www.hti.umich.edu/micase/ is complete, a similar database will be created with those data.  Finally, when a substantial portion of the American National Corpus (ANC) http://americannationalcorpus.org has been released, a third parallel database will be built.  Together these databases will permit comparative studies of phraseology in the principal variants of English.

Please note:

  -- "Unfiltered" queries which match very large datasets can take several minutes to complete.  Please be patient; read the tutorials and FAQ to focus your queries.

  -- Users who cannot access the above site may use
      http://kwicfinder.com/BNC/  (please let me know so we can investigate)


Acknowledgements

Above all I am grateful to Michael Stubbs of the University of Trier for detailed suggestions and ongoing discussions that led to the creation and refinement of this site; even the "easy as pie" to remember acronym goes back to him. His research assistants contributed as well:  Isabel Barth implemented the original phrase-frame generator and Katrin Ungeheuer offered valuable comments on organization and user-interface for query by text-type.  Finally Lou Burnard of the BNC Consortium and David Lee of MICASE granted essential permissions and provided useful feedback on the site.

All user feedback will be received enthusiastically!

Bill Fletcher

fletcher AT usna.edu
fletcher AT kwicfinder.com

http://pie.usna.edu
http://kwicfinder.com



More information about the Corpora mailing list