[Corpora-List] Tools for manual control of corpus annotation

Emiliano Guevara emiliano.guevara at unibo.it
Wed Nov 21 10:13:30 UTC 2007


Dear all,

in the first place, let me thank all of you for the your very helpful  
replies.
I will post a commented summary of the tips and software when the  
thread calms down...

Gerhard Kremer  poses a very relevant question which I should have  
clarified at the beginning: "what or how exactly should your task get  
easier to manage?". I apologise for this.

Below I try to spell out what I think a tool for manual annotation  
control should do (obviously, just a tentative sketch). Since I am  
using TreeTagger (and then the IMS Open Corpus Workbench, CWB), my  
ideas may not apply to everyone out there but, in any case, a large  
part of the community using these freely available tools will surely  
find the discussion useful.

Well, the premises are:
- a large text corpus in column format (output by TreeTagger, it  
could also be converted to XML)
- each line contains the columns "word...pos...lemma"
- the tagging is mostly right, but you need to individuate special  
cases for which the tagger doesn't have enough statistics, so  
automatic search and replace will not really help
- the resulting controlled corpus will be used for re-training the  
tagger
- the ideal tool would run on a UNIX-like system

The procedure, as I have done it so far for short portions of the  
text, is as simple as opening the text file in an editor, reading  
each line (paying attention to where the sentence/chunk starts) and  
manually selecting the wrong pos and replacing it with another one  
(copied form the tagset, in another file).
The tediousness of this procedure is clear: lots of attention needed  
to single out the problems, lots of clicking and scrolling for large  
files, lots of copying-pasting from the tagset, etc. All of this  
makes it very tiring and, above all, unsuitable as a task to give out  
to students.

The features that a manual annotation tool should have (ideally!) to  
make all of this easier:

- visualisation in pages (like `more' or `less' in UNIX), hopefully  
related to sentences or chunks as indicated by the tokenization/ 
tagging (this is quite easy if one uses XML and XPATH
- the field that we're interested in (the POS tag) should be the  
immediately highlighted and selected for modification
- if a POS is OK, or upon modification of a wrong POS, a simple key- 
binding should change the focus to the next POS
- if the whole visualised page is ok, another key-binding should  
change the focus to the next page
- insertion of a different POS should (ideally) be helped by auto- 
completion, or at least by checking consistency with the tagset  
provided in an external file

I know all of this is a lot to wish for! And I also know that it  
could be accomplished by programming a dedicated GUI... but anyway, I  
wanted to give it a try at CORPORA!

The most promising responses so far are the following:
1. creating a set of macros and key-bindings for Emacs (or using  
Emacs in XML mode for an XML corpus)
2. POSEDIT, a program made at the U. of Perugia and released under a  
Creative Commons License (http://elearning.unistrapg.it/corpora/ 
posedit.html). Unfortunately it is only for Window$... but I will  
soon give it a try.

thanks again and cheers,

E.




On 20 Nov 2007, at 17:34, Gerhard Kremer wrote:

> Hello Emiliano,
>
> Perhaps only a little hint,
> but with not very much effort you can
> configure/program the emacs editor to
> make your work faster and easier
> (i would look for how to manually define a
> keyboard macro and bind it to a key combination).
>
> By the way, what or how exactly should your
> task get easier to manage?
> Would probably be useful for the other readers...
>
> Regards,
> Gerhard Kremer


****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dip. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia

Homepage: http://morbo.lingue.unibo.it/

E-mail:   emiliano.guevara at unibo.it
           emiguevara at gmail.com
****************************************


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list