[Corpora-List] Fwd: subject : detection of the origin of language

Thu Apr 25 13:58:37 UTC 2013

saadane houda <saadane_houda at yahoo.fr> wrote:
> I want to ask you if you know of software or programs
> (open source) for the detection of the origin of
> language(Arabic, French or English)

The usual sorts of language ID programs work quite well if there's 
enough text, and the system has been trained on that kind of text.  
Mostly they use character n-grams.  There are at least two situations 
where they can go wrong:

1) They haven't been trained on the particular type of text, e.g. 
they've been trained on Unicode Arabic text, but not on Arabizi.

2) You're trying to detect code switching in text, e.g. the occasional 
use of English or French loanwords or other terms inside Arabic text, 
where the loanwords are written in the same script as the Arabic.

Problem (2) is made worse by the fact that many such English-in-Arabic 
words won't be in dictionaries (even assuming you try to map the Arabic 
script in a fuzzy way to roman script), because they're place names or 
person names.

There is also a problem of deciding whether a word *is* Arabic, English 
or French; I've heard the Arabic word 'mufti' used in English in ways 
that I'm guessing it wouldn't be used in Arabic.  Does it count as an 
Arabic word, or as English?  And if you think that one is clear (perhaps 
because its meaning in English is so divergent from its meaning in 
Arabic), then there are other, more borderline, examples.  (The same 
problem arises with place names; is 'Cairo' an Arabic word in English 
text, just because it refers to a place in the Arabic-speaking world?)  
Of course, whether sort of thing this is a problem depends on what use 
you want to put the results to.  Maybe place names don't matter for your 
purposes.

There are doubtless papers, even books, written on these issues.  (And 
there was a discussion on this list awhile back about it.)  Of course, 
if your task is to decide whether paragraphs-sized stretches of text are 
(mostly) English, French or Arabic, then the usual language ID programs 
will work just fine.

    Mike Maxwell
    University of Maryland

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora