[Corpora-List] Fwd: subject : detection of the origin of language

Thu Apr 25 20:58:51 UTC 2013

Hi, Mike,

Your second situation:

<<2) You're trying to detect code switching in text, e.g. the occasional
use of English or French loanwords or other terms inside Arabic text, where
the loanwords are written in the same script as the Arabic.>>

is NOT code-switching (almost never, anyway). My standard proof is the name
'Bach' as pronounced by minimally-educated speakers of English: [bax].
There is absolutely *no* requisite for such speakers to be bilingual in any
degree in German, and, if they are Americans, they are not, mostly.
Nevertheless, they use the non-English segment [x] if they are minimally
musically literate. It is important to recognize that the segment [x] used
by English speakers is *not* the same as that segment in the German
pronunciation of the name, which is more postvelar. That is, they are *not*
speaking German, even most of those who are bilingual in German. The word
is an *English* word (borrowed from German, but not, in most cases,
pronounced exactly as in German as it would be if the speaker were engaged
in code-switching). See below my signature for a pertinent limerick on this
point.

I am *not* claiming that there can be no one-word code-switching. Indeed, I
have witnessed in southern Texas a conversation by locals in a store where
there was code-switching between Spanish and English after every single
word! But there is a difference between borrowings and the use of
code-switching.

Jim

There once was a fellow named Hatch
Who was fond of the music of Bach.
He said, "It's not fussy,
Like Brahms or Debussy;
Sit down and I'll play you a snatch."

Now *that's* a borrowing! (But so is [bax].)

On Thu, Apr 25, 2013 at 8:58 AM, maxwell <maxwell at umiacs.umd.edu> wrote:

> saadane houda <saadane_houda at yahoo.fr> wrote:
>
>> I want to ask you if you know of software or programs
>> (open source) for the detection of the origin of
>> language(Arabic, French or English)
>>
>
> The usual sorts of language ID programs work quite well if there's enough
> text, and the system has been trained on that kind of text.  Mostly they
> use character n-grams.  There are at least two situations where they can go
> wrong:
>
> 1) They haven't been trained on the particular type of text, e.g. they've
> been trained on Unicode Arabic text, but not on Arabizi.
>
> 2) You're trying to detect code switching in text, e.g. the occasional use
> of English or French loanwords or other terms inside Arabic text, where the
> loanwords are written in the same script as the Arabic.
>
> Problem (2) is made worse by the fact that many such English-in-Arabic
> words won't be in dictionaries (even assuming you try to map the Arabic
> script in a fuzzy way to roman script), because they're place names or
> person names.
>
> There is also a problem of deciding whether a word *is* Arabic, English or
> French; I've heard the Arabic word 'mufti' used in English in ways that I'm
> guessing it wouldn't be used in Arabic.  Does it count as an Arabic word,
> or as English?  And if you think that one is clear (perhaps because its
> meaning in English is so divergent from its meaning in Arabic), then there
> are other, more borderline, examples.  (The same problem arises with place
> names; is 'Cairo' an Arabic word in English text, just because it refers to
> a place in the Arabic-speaking world?)  Of course, whether sort of thing
> this is a problem depends on what use you want to put the results to.
>  Maybe place names don't matter for your purposes.
>
> There are doubtless papers, even books, written on these issues.  (And
> there was a discussion on this list awhile back about it.)  Of course, if
> your task is to decide whether paragraphs-sized stretches of text are
> (mostly) English, French or Arabic, then the usual language ID programs
> will work just fine.
>
>    Mike Maxwell
>    University of Maryland
>
>
> ______________________________**_________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>

-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130425/9ead86c3/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora