[Corpora-List] Phrase extraction

Diana Maynard d.maynard at dcs.shef.ac.uk
Tue Oct 25 10:03:59 UTC 2005


Hi Helge
I am sure there are some Norwegian tagers out there somewhere, but I don't 
know if any of them are free.

If you don't have a suitable training corpus, and don't want to create one 
manually,  then
depending how ambiguous the language in question is with respect to POS, and 
how accurate you need your results, you might be able to generate a rough and 
ready POS tagger using just a monolingual (or bilingual) online Norwegian 
dictionary and a tagger such as the Brill tagger. I've done this for various 
languages by simply replacing the tagger's lexicon with a lexicon of the 
target language (using a few scripts to reformat it appropriately to match the 
Brill one) and using the default ruleset for the closest language to your 
target (in terms of family and behaviour). Then just run the tagger as usual 
on your corpus. You won't get perfect results but you might get something good 
enough for your purposes, depending what you want to do ultimately.
I've generated a Hindi tagger with around 70% accuracy in this way (using GATE 
and the Hepple tagger, which is like the Brill tagger) with nothing more than 
a basic Hindi-English bilingual dictionary. I've done the same for Western 
languages and got much better results.

See http://www.dcs.shef.ac.uk/~diana/publications.html
  for a paper which discusses using this technique to adapt an English NE 
system to the Cebuano language.

D. Maynard and V. Tablan and K. Bontcheva and H. Cunningham and Y. Wilks.
Rapid customisation of an Information Extraction system for surprise languages.
Special issue of ACM Transactions on Asian Language Information
Processing: Rapid Development of Language Capabilities: The Surprise Languages,
2003.

Of course there are lots of other ways, most of which will probably be more 
time-consuming but will get you better results.

Regards
Diana



Helge Thomas Karset Hellerud wrote:
> Hello,
> 
> PoS (Part of Speech) tagging is often used to extract phrases from text
> (like Noun Phrases). But that approach assumes you have a PoS tagger
> available. My document collection is in Norwegian, but I don't have a
> Norwegian tagger.
> 
> 1) Is there a way to create a simple PoS tagger to recognize verbs,
> nouns and adjectives (in Norwegian)?
> 
> 2) If not, do anyone have other approaches to extract phrases (like a
> statistical approach?)
> 
> Thanks in advance.
> 
> Helge
> 



More information about the Corpora mailing list