[Corpora-List] Phrase extraction

Diana Maynard d.maynard at dcs.shef.ac.uk
Wed Oct 26 08:29:37 UTC 2005


Apologies to those who noticed the broken link - I accidentally reset the 
permissions - it should be fixed now!
I should emphasise that the solutions proposed in this paper were very ad hoc 
- more a sneaky way of getting results fast rather than a "nice" solution! But 
useful as a means to an end.
Diana

Anna Feldman wrote:
> Dear Diana,
> 
> I'm very interested in the kind of work you are doing, but 
> unfortunately, the link to your publications page is broken. Could you 
> please check?
> 
> Thanks,
> 
> Anna Feldman
> 
> 
> 
> On Tue, 25 Oct 2005, Diana Maynard wrote:
> 
>> Hi Helge
>> I am sure there are some Norwegian tagers out there somewhere, but I 
>> don't know if any of them are free.
>>
>> If you don't have a suitable training corpus, and don't want to create 
>> one manually,  then
>> depending how ambiguous the language in question is with respect to 
>> POS, and how accurate you need your results, you might be able to 
>> generate a rough and ready POS tagger using just a monolingual (or 
>> bilingual) online Norwegian dictionary and a tagger such as the Brill 
>> tagger. I've done this for various languages by simply replacing the 
>> tagger's lexicon with a lexicon of the target language (using a few 
>> scripts to reformat it appropriately to match the Brill one) and using 
>> the default ruleset for the closest language to your target (in terms 
>> of family and behaviour). Then just run the tagger as usual on your 
>> corpus. You won't get perfect results but you might get something good 
>> enough for your purposes, depending what you want to do ultimately.
>> I've generated a Hindi tagger with around 70% accuracy in this way 
>> (using GATE and the Hepple tagger, which is like the Brill tagger) 
>> with nothing more than a basic Hindi-English bilingual dictionary. 
>> I've done the same for Western languages and got much better results.
>>
>> See http://www.dcs.shef.ac.uk/~diana/publications.html
>> for a paper which discusses using this technique to adapt an English 
>> NE system to the Cebuano language.
>>
>> D. Maynard and V. Tablan and K. Bontcheva and H. Cunningham and Y. Wilks.
>> Rapid customisation of an Information Extraction system for surprise 
>> languages.
>> Special issue of ACM Transactions on Asian Language Information
>> Processing: Rapid Development of Language Capabilities: The Surprise 
>> Languages,
>> 2003.
>>
>> Of course there are lots of other ways, most of which will probably be 
>> more time-consuming but will get you better results.
>>
>> Regards
>> Diana
>>
>>
>>
>> Helge Thomas Karset Hellerud wrote:
>>
>>> Hello,
>>>
>>> PoS (Part of Speech) tagging is often used to extract phrases from text
>>> (like Noun Phrases). But that approach assumes you have a PoS tagger
>>> available. My document collection is in Norwegian, but I don't have a
>>> Norwegian tagger.
>>>
>>> 1) Is there a way to create a simple PoS tagger to recognize verbs,
>>> nouns and adjectives (in Norwegian)?
>>>
>>> 2) If not, do anyone have other approaches to extract phrases (like a
>>> statistical approach?)
>>>
>>> Thanks in advance.
>>>
>>> Helge
>>>
>>
>>



More information about the Corpora mailing list