[Corpora-List] Q: Hyphenation removal
Roland Schäfer
roland.schaefer at fu-berlin.de
Thu Aug 16 11:37:53 UTC 2012
Dear list members,
are there any tools to remove hard-coded "hyphe- nation" from texts (or
papers describing principled solutions to the problem). The
tool/solution should ideally:
* work even if the line break after the hyphen has been removed,
* differentiate with near-perfect accuracy between actual hyphenation
and other superficially identical graphematic constructions involving
hyphens (like German truncated compound coordination as in "Bus- und
Bahnticket" for "Busticket und Bahnticket"),
* (consequently:) be trainable (ideally in an unsupervised way) on
arbitrary languages which use some specifiable set of characters to
indicate hyphenation,
* possibly also detect cases of hyphenation which are not written with a
space (as in "hyphe-nation") or additional spaces (as in "hyphe - nation"),
* process UTF-8 or at least some Unicode encoding,
* be open-source/patent-free and available in the form of a library or
command line tool (e.g., not a GUI tool/part of some OCR product/web
service).
Of course, references which only partially match these requirements are
also highly appreciated.
Thanks a lot.
Regards
Roland
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list