[Corpora-List] Horizontal spaces and sentence segmentation

Adam Radziszewski kocikikut at gmail.com
Mon Feb 1 10:13:56 UTC 2010


I've received a number of insightful points and hints. Thanks for
that! I will outline some of what I've learnt -- perhaps someone will
find it useful when searching the archive.

1. Searching for 'text normalization' and 'rich text transcription'.
There are some trainable methods as well:
- text segmentation may be cast as sequence labelling problem:
http://www.aclweb.org/anthology/P/P07/P07-1087.pdf (also contains good
lit. survey),
- having observed a sequence of characters, we decode the underlying
seq of tokens: Clark -- "Pre-Processing Very Noisy Text"
2. Some more grammars and software:
- http://www.cs.rochester.edu/u/tetreaul/academic.html (heading
Sentence Splitters)
- http://www.cis.uni-muenchen.de/~wastl/kurse/kut/eos.html,
http://www.cis.uni-muenchen.de/~wastl/misc/
- there is Unicode recommendation for sentence delimiting rules:
http://www.unicode.org/reports/tr29/#Sentence_Boundaries
3. Two or more consecutive newlines are considered a safe paragraph
boundary (but it's worth to ensure the following word is not
lower-case).
4. It's good to provide a couple of segmentation strategies and let
the user select one which suits his texts best.

> corpus. So for example if a corpus contains different kinds of texts which
> require different kinds of sentence splitting, this can be achieved without
> manual intervention.
I'm impressed. I will test the newest version (I remember GATE doing
well with segmentation into paragraphs although my input was quite
simple then). Actually I was thinking of having a first run for
gathering some simple statistics (average line length, white space
types etc.) and the second run for doing the proper segmentation --
but one-run seems definitely better as we don't need to buffer the
whole document.

Adam

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list