Corpora: Re: Arabic vs Spanish diacritics
Steven Krauwer
Steven.Krauwer at let.uu.nl
Mon Apr 23 22:23:46 UTC 2001
Tim Buckwalter wrote:
> The big difference between Arabic and accented languages such as Spanish
> in this regard is that accent-less Spanish is probably sub-standard or
> at least informal orthography. Whereas it is the norm for an entire
> formal Arabic newspaper to have only a dozen or so thoughtfully-placed
> short vowels & diacritics, an unaccented Spanish newspaper would be hard
> to imagine (I've never seen one, at least), or one with accents placed
> only where there is not enough context to know what is intended.
So, the picture is (in a very black and white version): the
Spanish have fewer diacritics (both types and tokens) but use
them
virtually all the time, and the Arabs have a lot more of them,
but they hardly ever use them.
I have three questions:
- does this difference have any measurable effect on the
learning process (for native speakers who learn to read
and write)
- same for parsing and processing by humans
- same for NLP
Any pointers to any empirical data?
I realize that we are now really moving away from this list's
core business, so I'll be happy to continue this discussion
somewhere else if people prefer that.
[ One place to go could be the email list
elsnet-arabic at elsnet.org
that we have just set up for discussing Arabic NLP and Speech
processing issues, but that hasn't been officially launched
yet. Subscription is already open at
http://utrecht.elsnet.org/subscriptions.html ]
Steven
More information about the Corpora
mailing list