Arabic-L:LING:Segmenter and Tagger for Arabic

Wed Jul 3 22:42:34 UTC 2002

----------------------------------------------------------------------
Arabic-L: Wed 03 Jul 2002
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message to listserv at byu.edu with first line reading:
          unsubscribe arabic-l                                      ]

-------------------------Directory-------------------------------------

1) Subject:Segmenter and Tagger for Arabic

-------------------------Messages--------------------------------------
1)
Date:  03 Jul 2002
From:Andrew Freeman <andyf at umich.edu>
Subject:Segmenter and Tagger for Arabic

Hi,
    The latest and greatest lexicon, transformation rules for
the tagger and segmenter and about 19000 words of annotated, segmented,
transliterated Arabic text can be obtained during the next month and a
half at the following URL
http://www-personal.umich.edu/~andyf/segmenter/segment_tag.zip
All of the executables run inside of a DOS window on a windoze machine.

Please report all bugs, difficulties and problems you encounter and I
will
try to deal.

After my signature block is the contents of the file segment_tag.readme
have fun,
andy

1.0 Overview
2.0 Transliteration
3.0 Segmenter
  3.1 Options
  3.2 lexicon files
  3.3 segment bigrams file
4.0 Brill's tagger
  4.1 tagger lexicon
  4.2 adding words from the segmenter
  4.3 perl files for extracting unknown words
  4.4 context file
  4.5 training the tagger

1.0 Overview
    The whole point here is to use Brill's tagger to use the
Transformation-based learning machine learning technique to facilitate
the
acquisition of disambiguation rules for doing part of speech tagging for
Arabic.  The point of doing POS tagging for Arabic is
        1) to build a parser,
        2) to annotate a corpus for doing empirical linguistic
research on Arabic,
        3) to provide electronic linguistic resources for Arabic
and
        4) do content-based text processing in Arabic.

    Brills tagger needs to have a lexicon of lexemes with the
allowable tags for that lexeme in the form shown below:
    squat JJ NN VB
    scrutiny NN
    James NNP
    LSO NNP
    missing VBG JJ NN

    As it turns out the Arabic writing system strings together a lot
of independent items that in English orthography are counted as separate.
The value added here is the word segmenter which is a sort of morphology
recognizer.

    The data flow for the system is shown below.  The transliteration
is necessary for two reasons: 1) Brill's tagger is written for ASCII and
2) there are at least two Arabic character sets in common use.

    Arabic text file --> (transliterator) --> (segmenter) --> (tagger)
--> transliterated tagged Arabic text

    Each piece a can be run independantly of every other piece.so a
more complete picture of the data flow would be
    Arabic text file --> (transliterator) --> transliterated Arabic
text
    transliterated Arabic text --> (segmenter) --> segmented and
transliterated Arabic text
    segmented and transliterated Arabic text --> (tagger) -->
transliterated, segmented and tagged Arabic text

    Neither the segmenter nor the tagger is perfect, so each one of
these phases requires a couple of iterations with hand-correction of the
output and/or augmentation of the lexicons between successive runs before
the output will be correct.
/**********************
        ****    Everything runs inside of a DOS window.
*********

****************************************************************/

2.0 Transliteration
    The executable that performs the transliteration is
"arabic_chars.exe"
The following command line switches are supported.
  -1  : transliterate MS-CP1256 text into DATR transliteration scheme\n
      : parm == Query string for Arabic_u.dtr lexicon\n",
      :        default DATR query ==: <voc>\n",
  -2  : change chars in Freeman transliteration into MS-CP1256 text,
  -3  : change chars in MS-CP1256 into Freeman transliteration scheme
      : parm == n if you don't want to strip the English characters
  -4  : change chars in MS-CP1256 into Buckwalter transliteration scheme
      : parm == n if you don't want to strip the English characters
  -5  : change chars in Buckwalter transliteration into MS-CP1256 text
  -6  : change chars in ISO-8859-6 into Buckwalter transliteration
  -7  : change chars in Buckwalter transliteration into ISO-8859-6 text

  Options -1 through -3 can be ignored and will be removed in a later
revision.  The batch file translit.bat contains the following command
line
for transliteratin cp-1256 into Buckwalter transliteration scheme.
    arabic_chars -4 <%1.txt > %1.trnsl
The input file needs to to have a ".txt" extension and the output file
will be given a ".trnsl" extension.
The following command line will transfer cp-1256 into the MAC version of
iso-8859-6
    arabic_chars -4 <%1.txt | arabic_chars -7 > %1.iso_txt

3.0 Segmenter
    The segmenter takes as its input Arabic text that is represented
in the Buckwalter transliteration schema.
  3.1 Options
    There is X option that will output all of the
segmentations/lexemes that are not in the current lexical files. With no
options, the the segmenter will just output the segmentation that it
decides is correct.  This "X" option allows you to simply cut these
lexemes that were not found in the lexicon(s) and after any needed
hand-corrections (there might not be any) paste them into the segmenter
lexicon.

/*****************************************************
**** The segmenter lexicon file needs to be sorted in ASCII ascending
order or nothing will work.
******************************************************************/
   You might want to hang on to this list of lexemes in order to add them
to the tagger's lexicon.
    There are two batch files that will perform these two options:
->segment_X.bat<- which contains the following command line
    seg_arabic X < %1.trnsl > %1.segm

and ->segment.bat<- which contains the following command line
    seg_arabic < %1.trnsl > %1.segm

  3.2 lexicon files
    The following files need to be present before the segmenter will
operate correctly
    small_lexicon_1a.txt            \\ these are all of the
stems that are not
                                    \\
affixal morphology and are not prepositions
                                    \\
must be sorted in ascending order
    arb_rots.lex                    \\ all valid roots
in the language as per Buckwalter
    prep_arabs.translit             \\ all
prepositions
    int inn_wuxt_count              \\ particles that
assign accusative case
    spesh_dict.txt                  \\ a Context-free
rules file for mapping a problematic
                                    \\
orthographic word into its constituent lexemes
    corpus_sorted.bigram            \\ a list of all bigrams
of all text correctly segmented
                                    \\
so far, sorted by count in decreasing order

  3.3 segment bigrams file
    Once you have the input file correctly segmented, cut and paste it
in to the file "corpus_all.segm."  Then you will need to create a new
"corpus_sorted.bigram".  The following two commands will perform this.
    text2wngram -n 2 < corpus_all.segm > corpus_all.bigram
    sort_bigram < corpus_all.bigram > corpus_sorted.bigram

    You may want to create a batch file for this.  There is nothing to
prevent running this all as one command line, i.e.
    text2wngram -n 2 < corpus_all.segm | sort_bigram >
corpus_sorted.bigram

    I personally prefer being able to take a look at the intermediate
file.  I have been keeping all of these files in a different directory
from the one where I am doing all of my annotation work.  So I then need
to copy this file, "corpus_sorted.bigram" back into my working directory.
This directory is named "texts_Folder\combined\segments" and should be
created when you unzip the archive.

4.0 Brill's tagger
    There is a batch file called tag_it.bat that runs the following
command line:
TAGGER.EXE ..\arab_lex.start %1.segm  BIGBIGRAMLIST LEXICALRULEFILE.andy
CNTEXT.andy > %1.tagged

    The following files need to be in the directory in which you run the
tagger:
    BIGBIGRAMLIST, LEXICALRULEFILE.andy CNTEXT.andy, tagger.exe,
start_state_tag.exe, and final_state_tag.exe.
    The file "arab_lex.start" needs to be in the parent directory.  If you
find this annoying, move "arab_lex.start" into the working directory and
change the batch file accordingly.

    Before running the tagger you will probably want to add the new words
that were discovered during the segmentation phase of the process.
  4.1 tagger lexicon
    The "file arab_lex.start" contains the tagger lexicon with entries
of the form
Eryf NNM JJMS
EryqAt NP
Erys NNM
Eskryp JJFS
mwqE NNM JJMS

  4.2 adding words from the segmenter
    Take the lexemes saved from the segmentation phase that you added
to the segmentation lexicon.  You will need to decide the correct tag(s)
for those lexemes.  If more than one tag is possible, place the most
likely tag first in the list of possible tags that follows the lexeme.
For
instance in the example above the lexeme "mwqE" can be either a noun or
an
adjective, but I have decided arbitrarily that the noun tag (NNM) is more
probable.  Now run the batch file tag_it.

  4.3 perl files for extracting unknown words
    There are still likely to be some unknown words.  Unknown words
that cannot be correctly tagged from the context are tagged with either
an
"NN" or an "NNP."  The NNP is for words beginning with a capital letter.
One annoying thing that will be fixed in a later revision is that some of
the letters used in the transliteration scheme overlap with perl's
special
command set.  I need to escape them.  Currently the string "$dyd/NN" will
get extracted as "dyd/NN" because the dollar sign is an anchor for the
regexp stuff.

    There are two batch files "xtract.bat" and "xtractNNP.bat" that
each invoke a perl script for extracting the words tagged with these
"unknown word tags."  If you do not have perl you can do a search for
them
in your text editor.  Take these unknown words, edit them with the
correct
tag and add them to the tagger lexicon "arab_lex.start"  If you have
added
the new words from the segmentation phase there should not be very many
of
these.  Finally, hand check and correct the tagged file.  At this phase I
have typically been seeing 7 or 8 errors in a 600 segment run of text or
just over one per cent of errors.  If you can think of a context rule
that
would fix the error add it to the file "CNTEXT.andy"  The tagger lexicon
file "arab_lex.start" does not need to be sorted.  The contextual rule
file needs to be in rule execution order, i.e. rules later in the file
get
performed later than rules located earlier in the file.
  4.4 context file
    The context file, CNTEXT.andy, currently being used has been
composed entirely by hand.  In the training directory there is a context
file called CONTEXT-RULEFILE that has been created by the training phase
of Brill's tagger.  On my to do list is comparing the performance of the
hand crafted rules file with the performance of the automatically
generated rules.  My hunch is that the hand-crafted rules are still more
accurate than the SW generated rules.
  4.5 training the tagger
    Read the file "readme.training".  There are batch files in the
text_folder\combined\training directory that will help with typing the
long command lines and stuff.  You will need a perl script interpreter.

    If you maintain the directory structure from the compressed
archive file, i.e. texts_Folder\combined with the subdirectories
        segments, taggeds, training, translits

    where there is a file called corpus_all.segm in the segments
directory and a file called corpus_all.tagged in the taggeds directory
AND
these files contain everything tagged and segmented and corrected so far,

>>>>>>>> ******************-----> Then
in the directory "training" you should be able to run batch files by
typing the following batch file commands in the order shown.
wordlist
divide_2
smallword_lst
bigram
lexrule
train_lexic
final_lexic
untag2
make_dummy_corp
learn_cntxt

   Finally, if you want to use these newly generated rules files
CONTEXT-RULEFILE and LEXRULEOUTFILE you need to copy them into your
directory where you are doing your segmenting and tagging and either
rename them or edit the batch file tag_it.bat to make the names match.

Good luck and have fun
andy June 28, 2002

the manouba paper and the ecl paper describe the algorithm of the
segmenter in some detail.  Also the most complete list of the tagset is
given in the file arabic_tags.txt.

The current to do list
1) integrate the tagger and the segmenter lexicons into a single
"database" and mark the difference between the list of verbs and nouns
for
the segmenter, eliminatign trying verb morphology on forms like tfEyl.
2) integrate all executable portions into a single executable with a gui
interface.
3) add unicode support
4) do the tagging and segmenting in the same pass using info from a tag
trigram model to help with the segmenting and the segmentation morphology
to aid in POS tagging.
5) currently I am segmenting, tagging and correcting roughly 600 segments
in about 2 hours.  This means that I could realistically tag about 9,000
words a week, and still have time to do other things.  There is every
indication that as I tag more text, the quicker it goes.  I could
realistically have 500,000 words or more tagged by this time next year.
With any help I am sure I could double that.
6) Start working on a probabilistic chart parser or implement the Collins
parser and start building a tree-bank, using the POS corpus that I am
accumulating.
7) Compare the performance of the hand crafted rules file with the
performance of the automatically generated rules.

--------------------------------------------------------------------------
End of Arabic-L:  03 Jul 2002