Arabic-L:LING:LDC new Morphological Analyzer (SAMA)

Dilworth Parkinson dil at BYU.EDU
Mon Jul 26 12:29:26 UTC 2010


------------------------------------------------------------------------
Arabic-L: Mon 26 Jul 2010
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject: LDC new Morphological Analyzer (SAMA)

-------------------------Messages-----------------------------------
1)
Date: 26 Jul 2010
From: Linguistic Data Consortium <ldc at ldc.upenn.edu>
Subject: LDC new Morphological Analyzer (SAMA)

(1)  The LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 was developed by researchers at LDC. SAMA 3.1 is based on, and updates Tim Buckwalter's Buckwalter Arabic Morphological Analyzer (BAMA) 2.0 (LDC2004L02). Since this is the first public release of SAMA, it has been numbered continuously to reflect the continuity between this release and previous BAMA releases.  SAMA 3.1 is a software tool for the morphological analysis of Standard Arabic. SAMA 3.1 considers each Arabic word token in all possible 'prefix-stem-suffix' segmentations, and lists all known/possible annotation solutions, with assignment of all diacritic marks, morpheme boundaries (separating clitics and inflectional morphemes from stems), and all Part-of-Speech (POS) labels and glosses for each morpheme segment. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices.

The software layer of SAMA 3.1 relies on a data layer that consists primarily of three Arabic-English lexicon files: prefixes (1328 entries), suffixes (945 entries), and stems (79318 entries representing 40654 lemmas). The lexicons are supplemented by three morphological compatibility tables used for controlling prefix-stem combinations (2497 entries), stem-suffix combinations (1632 entries), and prefix-suffix combinations (1180 entries).

The input format, output format, and data layer of SAMA 3.1 were designed to be backward compatible with BAMA. Incremental changes to the data layer in SAMA have resulted in:

increased lexicon coverage in the dictionary files
important changes and additions to the inventory of POS tags
more possible solutions generated for numerous word forms
The software implementation has been updated to allow more input/output options, installation and configuration options, and smoother incorporation in other Perl tools/services. The structure of the dictionary and morphotactic tables has remained the same (the tables provided with SAMA 3.1 differ from the BAMA 2.0 tables only in size and content, not in format). Logical separation between the software layer and data layer allows the new software tools to be used with previous versions of the tables (instructions are provided with software documentation).  The basic logic that implements the segmentation and analysis look-up for Arabic words is essentially unchanged since BAMA 2.0.

The data layer is now accessed through Berkeley DB, with result-caching enabled by default, leading to improved performance. Various utility scripts have also been added to the software package to facilitate more flexible interaction with tools and data.

As a Members-Only release, LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 is not available for non-member licensing.

--------------------------------------------------------------------------
End of Arabic-L: 26 Jul 2010



More information about the Arabic-l mailing list