[Corpora-List] sentence boundary detectors

Nino Simunic nino.simunic at uni-due.de
Wed Feb 28 11:09:47 UTC 2007


Dear Armin,

take a look at the >Punkt<-system. It's an >Unsupervised Multilingual
Sentence Boundary Detection< that was tested on eleven languages and
achieved pretty good scores:

Tibor Kiss, Jan Strunk. 2006. Unsupervised Multilingual Sentence Boundary
Detection. In: Computational Linguistics 32 (4). Cambridge: MIT-Press.
485-525.
PDF:
http://www.linguistics.ruhr-uni-bochum.de/~kiss/publications/compling2005_KS
27.01final.pdf

Their current implementation is written in Perl, as far as I know. 

Bye,
Nino

http://www.uni-due.de/computerlinguistik/simunic.shtml


>>-----Original Message-----
>>From: owner-corpora at lists.uib.no 
>>[mailto:owner-corpora at lists.uib.no] On Behalf Of Armin Schmidt
>>Sent: Tuesday, February 20, 2007 6:21 PM
>>To: Joel Tetreault
>>Cc: corpora at uib.no
>>Subject: Re: [Corpora-List] sentence boundary detectors
>>
>>
>>Joel,
>>
>>thanks. Unfortunately, many of the links on your page are 
>>indeed dead. But I'll post a summary of all the responses I 
>>got so far to the list, so you can update  your link list, too.
>>
>>Of course, I searched the archives (and the web) before 
>>posting to corpora list but the responses to those earlier 
>>posts were of limited use only for my particular task. Also, 
>>I wanted to find out if, in the meantime, sentence splitters 
>>had been developed which could be trained on particular 
>>corpora in an language-independent manner (more on this in my 
>>summary).
>>
>>Cheers,
>>Armin
>>
>>Joel Tetreault schrieb:
>>> 
>>> hi Armin, if you scroll way down to the "Tools" section of 
>>my website, 
>>> and then scroll down to the "Sentence Splitters" subsection, you 
>>> should find a links to several splitters.
>>> 
>>> http://www.cs.rochester.edu/u/tetreaul/academic.html
>>>
>>> (Please excuse the fact I threw all these links up one page :) )
>>> 
>>> Your question was posed to corpora-list 3 or 4 years ago, 
>>so all the 
>>> links above (including an updated link to Scott Piao's Java 
>>one) are 
>>> from other researchers emailing in with their suggestions.  
>>I just ran 
>>> through the links, and since it has been several years, a bunch are 
>>> dead.  But if you google the names of the splitter or their 
>>authors, 
>>> you can probably find their new locations.
>>> 
>>> I'd also check out the corpora-list archives:
>>> 
>>> http://listserv.linguistlist.org/cgi-bin/wa?S1=corpora
>>> 
>>> there might be some emails/links that I missed...
>>> 
>>> Joel
>>> 
>>> 
>>> On Mon, 19 Feb 2007, Scott Songlin Piao wrote:
>>> 
>>>> Hi Armin,
>>>>
>>>> I put my English sentence splitor on the website: 
>>>> http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool
>>>>
>>>> It is rule-based Java program and is downloadable.
>>>>
>>>> Cheers
>>>>
>>>> Scott Piao
>>>> ----------------------------
>>>> Text Mining
>>>> School of Computer Science
>>>> The University of Manchester
>>>> UK
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: owner-corpora at lists.uib.no 
>>[mailto:owner-corpora at lists.uib.no]
>>>> On Behalf Of Armin Schmidt
>>>> Sent: 17 February 2007 19:48
>>>> To: corpora at uib.no
>>>> Subject: [Corpora-List] sentence boundary detectors
>>>>
>>>> Dear list,
>>>>
>>>> I was wondering if you could point me to good sentence 
>>splitters for 
>>>> the following languages: German, Russian, Spanish, 
>>English. It would 
>>>> be great if they were stand-alone programs or modules for Python 
>>>> (Perl would be ok, too ... although I'm already aware of the 
>>>> respective CPAN-modules for English and German).
>>>>
>>>> Since I do have corpora in all the above mentioned 
>>languages I would 
>>>> also be very interested in available implementations (not 
>>papers) of 
>>>> any unsupervised learning methods for detecting sentence 
>>boundaries 
>>>> (or rather abbreviations).
>>>>
>>>> Thanks,
>>>> Armin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>> 
>>
>>-- 
>>http://diotavelli.net/people/armin/
>>
>>



More information about the Corpora mailing list