Arabic-L:LING:segmenter, tagger lexicon, spooler and a spoken corpus for Yemeni Arabic

Mon Mar 4 23:59:44 UTC 2002

----------------------------------------------------------------------
Arabic-L: Mon 04 Mar 2002
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message to listserv at byu.edu with first line
reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory-------------------------------------

1) Subject:segmenter, tagger lexicon, spooler and a spoken corpus for
Yemeni Arabic

-------------------------Messages--------------------------------------
1)
Date:  04 Mar 2002
From:Andrew Freeman <andyf at umich.edu>
Subject:segmenter, tagger lexicon, spooler and a spoken corpus for
Yemeni Arabic

Howdy y'all,

    If any of the above is of interest you might want to look at the URLs:
http://www-personal.umich.edu
/~andyf/segmenter/si-760/presentationfolder/hand_tag/
http://www-personal.umich.edu/
~andyf/segmenter/si-760/presentationfolder/trained_tagg/
code for retraining the tagger once you have grown the corpus is at
http://www-
personal.umich.edu/~andyf/segmenter/si-760/brills_tagger/utilities/
A windows build for English of Brill's tagger can be found at
http://www-personal.umich.edu/
~andyf/segmenter/si-760/brills_tagger/bin_and_data/
A small corpus of spoken Yemeni Arabic can be found at:
http://www-personal.umich.edu/~andyf/segmenter/corpora/

Brill's tagger is public domain.  The segmenter lets me build a
reasonable
sized tagger lexicon to Brill's tagger.  It is still very much a rough
hack, and actually I would not let it out, but at ALS-XVI enough people
convinced me that they could start using it in its current very rough
form.

The comments do not all match the current state of the edits.  Also the
segmenter currently returns the first match even when there may be more
than one possible segmenting.  The next edit which will be done before
the
end of April will include some statistical smarts based on training the
stats on the current segmented corpus which is about 16,000 segments once
I am done correcting the current file, which is ch1_sparrow.segm and
ch1_sparrow_tagged.

This stuff comes as is, and I won't have a minute to even create a eradme
file until sometime in April.

If anybody wants to share any text that they might end up annotating with
these tools, I won't complain.

cheers,
andy

--------------------------------------------------------------------------
End of Arabic-L:  04 Mar 2002

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 2865 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20020304/76f8b26f/attachment-0001.bin>