[Corpora-List] Q: Hyphenation removal
John D Burger
john at mitre.org
Mon Aug 20 14:35:37 UTC 2012
Jordi Porta Zamorano wrote:
> Dear Ronald,
>
> Perhaps the data-driven solution described in http://www.euppublishing.com/doi/abs/10.3366/cor.2011.0010 could help you.
Anyone have a pointer to a copy of this paper that doesn't cost money?
- John Burger
MITRE
> On Thursday, 16 August 2012, Roland Schäfer wrote:
> Dear list members,
>
> are there any tools to remove hard-coded "hyphe- nation" from texts (or
> papers describing principled solutions to the problem). The
> tool/solution should ideally:
>
> * work even if the line break after the hyphen has been removed,
>
> * differentiate with near-perfect accuracy between actual hyphenation
> and other superficially identical graphematic constructions involving
> hyphens (like German truncated compound coordination as in "Bus- und
> Bahnticket" for "Busticket und Bahnticket"),
>
> * (consequently:) be trainable (ideally in an unsupervised way) on
> arbitrary languages which use some specifiable set of characters to
> indicate hyphenation,
>
> * possibly also detect cases of hyphenation which are not written with a
> space (as in "hyphe-nation") or additional spaces (as in "hyphe - nation"),
>
> * process UTF-8 or at least some Unicode encoding,
>
> * be open-source/patent-free and available in the form of a library or
> command line tool (e.g., not a GUI tool/part of some OCR product/web
> service).
>
> Of course, references which only partially match these requirements are
> also highly appreciated.
>
> Thanks a lot.
>
> Regards
> Roland
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list