[Corpora-List] sentence detector and phrase chunker returning absolute positions in text

Adam Funk a.funk at dcs.shef.ac.uk
Mon Jul 19 08:58:33 UTC 2010


[19/07/10 09:18] Adam Radziszewski wrote:
> Dear Wiebke,
> 
>> I have checked OpenNLP, Gate, LingPipe and MontyLingua but did not find
> 
> I doubt if Gate has a ready-made option triggerable from the GUI to
> output these positions only. However, its XML format is based on such
> character positions:
> http://gate.ac.uk/sale/tao/splitch5.html#sec:corpora:schemas
> and, what is more, if you write your own plugin or just use the Java
> API, you can easily iterate over Annotation objects and fetch their
> starting and ending positions.

It's easier than that in GATE: just load ANNIE; delete the gazetteer,
POS tagger, and NE transducer; and add a JAPE transducer created from
the attached file.  If you run this is the GATE GUI, the output will
appear in the "Messages" pane (you can copy and paste from there).

Note that the numbers are character offsets in the plain-text content of
the file (after removing HTML or XML tags and converting them to
annotations) and that the ANNIE sentence splitter does not include the
spaces and newlines between sentences in the sentences themselves, so
there will be gaps in the numbers.  Here's some sample output to
illustrate that.

0 -> 35
37 -> 46
48 -> 240
242 -> 670
672 -> 918
920 -> 1117
1118 -> 1224
1225 -> 1379
1381 -> 1638
1640 -> 1713
1715 -> 1743

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sentences.jape
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100719/60862e19/attachment-0001.ksh>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list