[Corpora-List] Phrase extraction
Diana Maynard
d.maynard at dcs.shef.ac.uk
Tue Oct 25 10:03:59 UTC 2005
Hi Helge
I am sure there are some Norwegian tagers out there somewhere, but I don't
know if any of them are free.
If you don't have a suitable training corpus, and don't want to create one
manually, then
depending how ambiguous the language in question is with respect to POS, and
how accurate you need your results, you might be able to generate a rough and
ready POS tagger using just a monolingual (or bilingual) online Norwegian
dictionary and a tagger such as the Brill tagger. I've done this for various
languages by simply replacing the tagger's lexicon with a lexicon of the
target language (using a few scripts to reformat it appropriately to match the
Brill one) and using the default ruleset for the closest language to your
target (in terms of family and behaviour). Then just run the tagger as usual
on your corpus. You won't get perfect results but you might get something good
enough for your purposes, depending what you want to do ultimately.
I've generated a Hindi tagger with around 70% accuracy in this way (using GATE
and the Hepple tagger, which is like the Brill tagger) with nothing more than
a basic Hindi-English bilingual dictionary. I've done the same for Western
languages and got much better results.
See http://www.dcs.shef.ac.uk/~diana/publications.html
for a paper which discusses using this technique to adapt an English NE
system to the Cebuano language.
D. Maynard and V. Tablan and K. Bontcheva and H. Cunningham and Y. Wilks.
Rapid customisation of an Information Extraction system for surprise languages.
Special issue of ACM Transactions on Asian Language Information
Processing: Rapid Development of Language Capabilities: The Surprise Languages,
2003.
Of course there are lots of other ways, most of which will probably be more
time-consuming but will get you better results.
Regards
Diana
Helge Thomas Karset Hellerud wrote:
> Hello,
>
> PoS (Part of Speech) tagging is often used to extract phrases from text
> (like Noun Phrases). But that approach assumes you have a PoS tagger
> available. My document collection is in Norwegian, but I don't have a
> Norwegian tagger.
>
> 1) Is there a way to create a simple PoS tagger to recognize verbs,
> nouns and adjectives (in Norwegian)?
>
> 2) If not, do anyone have other approaches to extract phrases (like a
> statistical approach?)
>
> Thanks in advance.
>
> Helge
>
More information about the Corpora
mailing list