Corpora: corpus of Information Tech.

Kim Tan kimmy1003 at hotmail.com
Thu Nov 23 02:35:45 UTC 2000


Hi all,

I'm currently involved in a small project that involves terminology building
( Information Technology (IT)) from two content parallel corpora ie English
and Malay . The texts are not translated texts but are weekly newspaper
pullouts on IT , both dealing in the same area.

At this stage, we're still handling the English articles & trying to
identify IT specific words. With a corpus of nearly 400 000 words, a
wordlist has been generated based on frequency count. This list is compared
with the wordlist of a general corpus of Malaysian English (ME) of 300,000
words, the freq. of ME are then adjusted, after which the freq.index is
calculated. By looking at the index, words that are over represented in the
sp. corpus as compared to the general corpus are then said to be IT specific
words.

My question is whether this would be a valid claim & also whether there are
other ways of identifying words ( statistically or otherwise )that are
specific to a specialized area . As I'm rather new to this area, I'd
appreciate any form of input ..

Seeking your expertise

KIM
National Univ. of Malaysia
_____________________________________________________________________________________
Get more from the Web.  FREE MSN Explorer download : http://explorer.msn.com



More information about the Corpora mailing list