[Corpora-List] sentence boundary detectors
Nino Simunic
nino.simunic at uni-due.de
Wed Feb 28 11:09:47 UTC 2007
Dear Armin,
take a look at the >Punkt<-system. It's an >Unsupervised Multilingual
Sentence Boundary Detection< that was tested on eleven languages and
achieved pretty good scores:
Tibor Kiss, Jan Strunk. 2006. Unsupervised Multilingual Sentence Boundary
Detection. In: Computational Linguistics 32 (4). Cambridge: MIT-Press.
485-525.
PDF:
http://www.linguistics.ruhr-uni-bochum.de/~kiss/publications/compling2005_KS
27.01final.pdf
Their current implementation is written in Perl, as far as I know.
Bye,
Nino
http://www.uni-due.de/computerlinguistik/simunic.shtml
>>-----Original Message-----
>>From: owner-corpora at lists.uib.no
>>[mailto:owner-corpora at lists.uib.no] On Behalf Of Armin Schmidt
>>Sent: Tuesday, February 20, 2007 6:21 PM
>>To: Joel Tetreault
>>Cc: corpora at uib.no
>>Subject: Re: [Corpora-List] sentence boundary detectors
>>
>>
>>Joel,
>>
>>thanks. Unfortunately, many of the links on your page are
>>indeed dead. But I'll post a summary of all the responses I
>>got so far to the list, so you can update your link list, too.
>>
>>Of course, I searched the archives (and the web) before
>>posting to corpora list but the responses to those earlier
>>posts were of limited use only for my particular task. Also,
>>I wanted to find out if, in the meantime, sentence splitters
>>had been developed which could be trained on particular
>>corpora in an language-independent manner (more on this in my
>>summary).
>>
>>Cheers,
>>Armin
>>
>>Joel Tetreault schrieb:
>>>
>>> hi Armin, if you scroll way down to the "Tools" section of
>>my website,
>>> and then scroll down to the "Sentence Splitters" subsection, you
>>> should find a links to several splitters.
>>>
>>> http://www.cs.rochester.edu/u/tetreaul/academic.html
>>>
>>> (Please excuse the fact I threw all these links up one page :) )
>>>
>>> Your question was posed to corpora-list 3 or 4 years ago,
>>so all the
>>> links above (including an updated link to Scott Piao's Java
>>one) are
>>> from other researchers emailing in with their suggestions.
>>I just ran
>>> through the links, and since it has been several years, a bunch are
>>> dead. But if you google the names of the splitter or their
>>authors,
>>> you can probably find their new locations.
>>>
>>> I'd also check out the corpora-list archives:
>>>
>>> http://listserv.linguistlist.org/cgi-bin/wa?S1=corpora
>>>
>>> there might be some emails/links that I missed...
>>>
>>> Joel
>>>
>>>
>>> On Mon, 19 Feb 2007, Scott Songlin Piao wrote:
>>>
>>>> Hi Armin,
>>>>
>>>> I put my English sentence splitor on the website:
>>>> http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool
>>>>
>>>> It is rule-based Java program and is downloadable.
>>>>
>>>> Cheers
>>>>
>>>> Scott Piao
>>>> ----------------------------
>>>> Text Mining
>>>> School of Computer Science
>>>> The University of Manchester
>>>> UK
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: owner-corpora at lists.uib.no
>>[mailto:owner-corpora at lists.uib.no]
>>>> On Behalf Of Armin Schmidt
>>>> Sent: 17 February 2007 19:48
>>>> To: corpora at uib.no
>>>> Subject: [Corpora-List] sentence boundary detectors
>>>>
>>>> Dear list,
>>>>
>>>> I was wondering if you could point me to good sentence
>>splitters for
>>>> the following languages: German, Russian, Spanish,
>>English. It would
>>>> be great if they were stand-alone programs or modules for Python
>>>> (Perl would be ok, too ... although I'm already aware of the
>>>> respective CPAN-modules for English and German).
>>>>
>>>> Since I do have corpora in all the above mentioned
>>languages I would
>>>> also be very interested in available implementations (not
>>papers) of
>>>> any unsupervised learning methods for detecting sentence
>>boundaries
>>>> (or rather abbreviations).
>>>>
>>>> Thanks,
>>>> Armin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>--
>>http://diotavelli.net/people/armin/
>>
>>
More information about the Corpora
mailing list