[Corpora-List] 'Standard European English' ? Web Corpus-building tools
William Fletcher
fletcher at usna.edu
Thu Mar 2 18:18:13 UTC 2006
Dear Eric, and others with similar projects,
For your students' "courseworks" (like Carmela I'd use a singular here, and probably say "course projects" instead) I have a suite of Windows tools that makes it easy to compile a "corpus" of Web pages in a few hours. While some of the tools have rough edges, I'd be glad to make them available to any understanding party who e-mails me (give me a day or two to make a rough guide to the process):
1. KWiCFinder (free from KWiCFinder.com), which conducts searches on specific words on "AltaVista" (now actually the Yahoo search engine) and downloads matching webpages. Searches can be restricted to a specific domain and language. The developmental version I use allows bypassing search report generation, so it runs much faster than the current release version. With broadband you can conduct 30-40 searches simultaneously and download several thousand matching pages an hour.
2. kfWinnow, which processes downloaded pages to eliminate duplicates and pages with very low or high word counts, which have low signal/noise ratio and high chance of repetition respectively (think I did that right).
3. kfNgram (ditto), which helps identify highly repetitive (long) documents (HRDs): look for multiple occurrences of 10- and 25-grams, and which prepares n-grams for 4 (not essential, but valuable for studying Euro-English phraseology and comparing it to English (e.g. from my BNC-based Phrases in English site http://pie.usna.edu ).
4. kfNgramDB, which imports the output of steps 2 and 3 into a MySQL database for further study. It supplies default DB schemas and generates default queries and models more sophisticated queries for those willing to tinker a bit with SQL. It also downloads and imports datasets from PIE.
Looking forward to seeing your students' results eventually, whatever tools they use!
Regards,
Bill Fletcher
PS My favorite Euro-English word is _beamer_ 'LCD data / video projector'.
>>> Eric Atwell <eric at comp.leeds.ac.uk> 03/02/06 6:00 AM >>>
My intuition is that, in addition to some "pan-european(except-UK)" English terms,
as suggested by Harold, there will be national variants of English with
local L1-inspired vocabulary and usages.
I have just set my final-year undergrad Computing class a coursework challenge,
"Finding English terms specific to a domain on the World Wide Web",
where "domain" here means a national top-level domain like .DE or .UK
- the 85 students in the class each have to study WWW-English in a different
country, and many have signed up for European nations.
So, I should have some answers for you after 24 March when the courseworks
have to be submitted!
Eric Atwell, School of Computing, Leeds University
PS CORPORA readers are welcome to send me advice or tips to pass on to
my students, esp on appropriate technologies they can use (so they
dont have to write the programs themselves!) - the coursework outline is
http://www.comp.leeds.ac.uk/eric/db32cw.doc
On Thu, 2 Mar 2006, Parveen Lallmamode wrote:
> Has anyone of you here ever heard of a 'Standard European English'? If yes:
>
> - What are its characteristics?
> - Which researcher added that 'English' to the World Englishes?
> - How does it differ from the 'Standard British English'?
> - Where can I read more about it?
>
> Thanking you all in advance.
>
>
>
>
>
>
--
Eric Atwell, Senior Lecturer, Language research group, School of Computing,
Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
TEL: +44-113-2335430 FAX: +44-113-2335468 http://www.comp.leeds.ac.uk/eric
More information about the Corpora
mailing list