[Corpora-List] Horizontal spaces and sentence segmentation

Sun Jan 31 21:30:34 UTC 2010

We use something like this technique for the sentence splitter in GATE 
(http://gate.ac.uk). We provide several slightly different sets of 
rules, and leave it to the user to select the ruleset which is most 
appropriate according to their texts (usually, it is exactly the case 
that Marcin describes in his first sentence). We also have the 
possibility in GATE to use a different version of the sentence splitter 
for different texts, and to do the selection automatically within a 
single run of the application on the corpus. So for example if a corpus 
contains different kinds of texts which require different kinds of 
sentence splitting, this can be achieved without manual intervention.
Diana

Marcin Miłkowski wrote:
> Dear Adam,
> 
> I think in most cases it's being assumed that a paragraph ends with a 
> single newline or more than one. It's hard to decide which is best 
> without seeing the data.
> 
> However, you could use similar heuristics as for dots in many languages. 
> For example, if the next token after a newline is a lowercase word, you 
> probably haven't got a new paragraph. If this is uppercase (and not a 
> proper name), then it probably is. There should be many regularities to 
> discover. I suppose you can either try to train a segmenter (if you have 
> a manually segmented corpus at hand) or try to use rule-based approach 
> and see how it works on your data.
> 
> Hope this helps,
> Marcin Miłkowski
> 
> W dniu 2010-01-30 10:34, Adam Radziszewski pisze:
>> Dear corpora users,
>> I'm looking for hints or recipes on segmentation of plain text into
>> sentences. I know the topic is really down-to-earth and has been
>> discussed too many times, yet having skimmed a number of NLP/CL/IR
>> textbooks and dedicated articles I still can't find any hint on an
>> inescapable problem: when to split sentences if there is no explicit
>> 'dot' character (in European languages).
>>
>> All the resources I've seen concentrate on abbreviations and deciding
>> when a dot is indeed a full stop. The problem appears when dealing
>> with plain text (or after removal of mark-up which is usually
>> unreliable) with no explicit paragraph boundaries. For instance a
>> heading followed by a paragraph or a list of items ends up as a huge
>> sentence (as there was no dot character in the input). This is
>> especially painful when dealing with text extracted from PDF files or
>> ill-formatted Wikipedia distillate. Some textbooks mention the general
>> problem of splitting text into paragraphs, yet the methods employed
>> seem too sophisticated at this level (e.g. referring to lexical
>> semantics).
>>
>> Some implicit splitting on newlines is sometimes done (e.g. NLTK
>> implementation of Punkt assumes each newline being a sentence break).
>> I know the `correct' behaviour is dependent on the expected formatting
>> of input, nevertheless I wonder if anyone has tried a systematic
>> approach to this issue. This would help me avoid reinventing the
>> wheel.
>>
>> Any suggestions, references or links to software will be appreciated.
>> And sorry for spamming if you consider this problem already solved or
>> too obvious (perhaps there can't be a systematic way and that's it).
>>
>> Best,
>> Adam Radziszewski

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora