[Corpora-List] Fwd: subject : detection of the origin of language
maxwell
maxwell at umiacs.umd.edu
Thu Apr 25 13:58:37 UTC 2013
saadane houda <saadane_houda at yahoo.fr> wrote:
> I want to ask you if you know of software or programs
> (open source) for the detection of the origin of
> language(Arabic, French or English)
The usual sorts of language ID programs work quite well if there's
enough text, and the system has been trained on that kind of text.
Mostly they use character n-grams. There are at least two situations
where they can go wrong:
1) They haven't been trained on the particular type of text, e.g.
they've been trained on Unicode Arabic text, but not on Arabizi.
2) You're trying to detect code switching in text, e.g. the occasional
use of English or French loanwords or other terms inside Arabic text,
where the loanwords are written in the same script as the Arabic.
Problem (2) is made worse by the fact that many such English-in-Arabic
words won't be in dictionaries (even assuming you try to map the Arabic
script in a fuzzy way to roman script), because they're place names or
person names.
There is also a problem of deciding whether a word *is* Arabic, English
or French; I've heard the Arabic word 'mufti' used in English in ways
that I'm guessing it wouldn't be used in Arabic. Does it count as an
Arabic word, or as English? And if you think that one is clear (perhaps
because its meaning in English is so divergent from its meaning in
Arabic), then there are other, more borderline, examples. (The same
problem arises with place names; is 'Cairo' an Arabic word in English
text, just because it refers to a place in the Arabic-speaking world?)
Of course, whether sort of thing this is a problem depends on what use
you want to put the results to. Maybe place names don't matter for your
purposes.
There are doubtless papers, even books, written on these issues. (And
there was a discussion on this list awhile back about it.) Of course,
if your task is to decide whether paragraphs-sized stretches of text are
(mostly) English, French or Arabic, then the usual language ID programs
will work just fine.
Mike Maxwell
University of Maryland
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list