[Corpora-List] Tools for manual control of corpus annotation
Emiliano Guevara
emiliano.guevara at unibo.it
Wed Nov 21 10:13:30 UTC 2007
Dear all,
in the first place, let me thank all of you for the your very helpful
replies.
I will post a commented summary of the tips and software when the
thread calms down...
Gerhard Kremer poses a very relevant question which I should have
clarified at the beginning: "what or how exactly should your task get
easier to manage?". I apologise for this.
Below I try to spell out what I think a tool for manual annotation
control should do (obviously, just a tentative sketch). Since I am
using TreeTagger (and then the IMS Open Corpus Workbench, CWB), my
ideas may not apply to everyone out there but, in any case, a large
part of the community using these freely available tools will surely
find the discussion useful.
Well, the premises are:
- a large text corpus in column format (output by TreeTagger, it
could also be converted to XML)
- each line contains the columns "word...pos...lemma"
- the tagging is mostly right, but you need to individuate special
cases for which the tagger doesn't have enough statistics, so
automatic search and replace will not really help
- the resulting controlled corpus will be used for re-training the
tagger
- the ideal tool would run on a UNIX-like system
The procedure, as I have done it so far for short portions of the
text, is as simple as opening the text file in an editor, reading
each line (paying attention to where the sentence/chunk starts) and
manually selecting the wrong pos and replacing it with another one
(copied form the tagset, in another file).
The tediousness of this procedure is clear: lots of attention needed
to single out the problems, lots of clicking and scrolling for large
files, lots of copying-pasting from the tagset, etc. All of this
makes it very tiring and, above all, unsuitable as a task to give out
to students.
The features that a manual annotation tool should have (ideally!) to
make all of this easier:
- visualisation in pages (like `more' or `less' in UNIX), hopefully
related to sentences or chunks as indicated by the tokenization/
tagging (this is quite easy if one uses XML and XPATH
- the field that we're interested in (the POS tag) should be the
immediately highlighted and selected for modification
- if a POS is OK, or upon modification of a wrong POS, a simple key-
binding should change the focus to the next POS
- if the whole visualised page is ok, another key-binding should
change the focus to the next page
- insertion of a different POS should (ideally) be helped by auto-
completion, or at least by checking consistency with the tagset
provided in an external file
I know all of this is a lot to wish for! And I also know that it
could be accomplished by programming a dedicated GUI... but anyway, I
wanted to give it a try at CORPORA!
The most promising responses so far are the following:
1. creating a set of macros and key-bindings for Emacs (or using
Emacs in XML mode for an XML corpus)
2. POSEDIT, a program made at the U. of Perugia and released under a
Creative Commons License (http://elearning.unistrapg.it/corpora/
posedit.html). Unfortunately it is only for Window$... but I will
soon give it a try.
thanks again and cheers,
E.
On 20 Nov 2007, at 17:34, Gerhard Kremer wrote:
> Hello Emiliano,
>
> Perhaps only a little hint,
> but with not very much effort you can
> configure/program the emacs editor to
> make your work faster and easier
> (i would look for how to manually define a
> keyboard macro and bind it to a key combination).
>
> By the way, what or how exactly should your
> task get easier to manage?
> Would probably be useful for the other readers...
>
> Regards,
> Gerhard Kremer
****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dip. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia
Homepage: http://morbo.lingue.unibo.it/
E-mail: emiliano.guevara at unibo.it
emiguevara at gmail.com
****************************************
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list