Dear Ronald,<div><br></div><div dir="ltr">Perhaps the data-driven solution described in <a href="http://www.euppublishing.com/doi/abs/10.3366/cor.2011.0010">http://www.euppublishing.com/doi/abs/10.3366/cor.2011.0010</a> could help you.<span></span></div>
<div dir="ltr"><br></div><div dir="ltr">Best,</div><div dir="ltr">--Jordi<br><br>On Thursday, 16 August 2012, Roland Schäfer wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Dear list members,<br>
<br>
are there any tools to remove hard-coded "hyphe- nation" from texts (or<br>
papers describing principled solutions to the problem). The<br>
tool/solution should ideally:<br>
<br>
* work even if the line break after the hyphen has been removed,<br>
<br>
* differentiate with near-perfect accuracy between actual hyphenation<br>
and other superficially identical graphematic constructions involving<br>
hyphens (like German truncated compound coordination as in "Bus- und<br>
Bahnticket" for "Busticket und Bahnticket"),<br>
<br>
* (consequently:) be trainable (ideally in an unsupervised way) on<br>
arbitrary languages which use some specifiable set of characters to<br>
indicate hyphenation,<br>
<br>
* possibly also detect cases of hyphenation which are not written with a<br>
space (as in "hyphe-nation") or additional spaces (as in "hyphe - nation"),<br>
<br>
* process UTF-8 or at least some Unicode encoding,<br>
<br>
* be open-source/patent-free and available in the form of a library or<br>
command line tool (e.g., not a GUI tool/part of some OCR product/web<br>
service).<br>
<br>
Of course, references which only partially match these requirements are<br>
also highly appreciated.<br>
<br>
Thanks a lot.<br>
<br>
Regards<br>
Roland<br>
<br>
_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="javascript:;" onclick="_e(event, 'cvml', 'Corpora@uib.no')">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</blockquote></div>