6.1264, Sum: Languages With No Between-word Delimiters
The Linguist List
linguist at tam2000.tamu.edu
Sun Sep 17 15:27:37 UTC 1995
---------------------------------------------------------------------------
LINGUIST List: Vol-6-1264. Sun Sep 17 1995. ISSN: 1068-4875. Lines: 143
Subject: 6.1264, Sum: Languages With No Between-word Delimiters
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>
Associate Editor: Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
Ann Dizdar <dizdar at tam2000.tamu.edu>
Annemarie Valdez <avaldez at emunix.emich.edu>
Software development: John H. Remmers <remmers at emunix.emich.edu>
Editor for this issue: lveselin at emunix.emich.edu (Ljuba Veselinova)
---------------------------------Directory-----------------------------------
1)
Date: Sun, 17 Sep 1995 08:02:00 EDT
From: fujii at mackay.cs.umass.edu (Hideo Fujii)
Subject: Prelim.Summary: languages with no between-word delimiters
---------------------------------Messages------------------------------------
1)
Date: Sun, 17 Sep 1995 08:02:00 EDT
From: fujii at mackay.cs.umass.edu (Hideo Fujii)
Subject: Prelim.Summary: languages with no between-word delimiters
Dear people in LINGUIST and NLPASIA-L,
In Linguist List (Vol-6-1244. 9/14/95,(Thu)), I submitted following
question:
>> I want to make a list of languages to classify if a (written) language
>> uses a between-word delimiter (e.g., space in English), or not.
>> That is, if it doesn't have such delimiters, we need to segment
>> for the language processing (by human or computer).
>>
>> You can tell me:
>> 1) Name of the language,
>> 2) Segmentation - Need or No Need,
>> 3) Letters - Use Alphabets (as a group) or not. Or, other graphic
>> group (Cyrillic, Chinese characters, or Own special, etc.).
>> No detail.
>> 4) Note - If you like, short comment.
This is a (first) preliminary summary for this inquiry. I've also included
some my data.
So far I've received 12 responses from following people. I want to
say thank you for these people.
From: Shanley Allen <allen at mpi.nl>
From: Philippe Mennecier <ferry at cimrs1.mnhn.fr>
From: Stavros Macrakis <macrakis at osf.org>
From: Dan I. Slobin <slobin at cogsci.Berkeley.EDU>
From: Boris Fridman Mintz <fridman at ucol.mx>
From: Allan C Wechsler <Wechsler at world.std.com>
From: Wolfram Kahl <kahl at hermes.informatik.unibw-muenchen.de>
From: Stefan Frisch <frisch at babel.ling.nwu.edu>
From: Doug Cooper <doug at chulkn.car.chula.ac.th>
From: Nicholas Ostler <nostler at chibcha.demon.co.uk>
From: Steve Seegmiller <SEEGMILLER at apollo.montclair.edu>
From: Duncan MacGregor <aa735 at freenet.carleton.ca>
First, I show the list of languages whether or not it has a delimiter symbol
for the 'word' boundary in text like a blank space between words in English:
Q: Does the language have word-boundary delimiters?
[YES]: Inuktitut(Eskimo), Amharic, Cherokee(?), Arabic(??),
Hebrew(Modern), Yiddish(Judeo-German),Ladino(Judio-Spanish),
[NO]: Sanskrit, Thai, Lao, Khmer, Burmese(?), Tibetan, Mongolian(?),
Manchu(?), Japanese, Chinese, Korean(?)
Here, I excluded historical/classical/medival/extinct languages because
those are not a concern of this survay.
I hope I didn't misunderstand what responders wrote. If you find mistake
or you can clarify (?)-item in this list, please send me a message.
(Hereafter I will call 'word-boundary delimiter' simply 'delimiter'.
There are comments about the confusion of terminology such as
"segmentation" or "separation"; "spaces/blanks", "punctuations",
"word breaks" or "delimiters"; "segmented" means either "the text is
'segmented' as is" or "the text must be 'segmented' to separate words".
I will restate my question at the end of this message.)
At least so far, I didn't see a counter-example to my guessing, i.e.,
most Asian languages don't have delimiters to separate words
no matter the letters have a phonetic or ideo/logographic (except
languages with Romanized characters).
Obviously we don't have enough data to cover many of the typological language
families. I like to see more languages' data. I welcome your further
contributions especially for the languages at the end of this message.
I got several valuable comments such as:
1) According to Doug Cooper, there are indian languages which "are segmented,
while others, of similar origin, are not". If so, it implies that
language's letters are not a definite factor if it has delimiters or not.
2) Even though above Cooper's observation, "it is probably safe to say that
all modern languages that use a Latin-, Cyrillic, or Greek-based writing
system use a blank space as a delimiter" according to Steve Seegmiller.
I've counted the frequencies of Latin-, Cyrillic, or Greek-based languages
using the data in Campbell's Concise Compendium of the World Languages(1995),
in 96 languages. The result was 63% (61 languages) are one of these three
types. Althoyugh this data is not sampled typologically fair, but based
on the population of speakers, anyway the establishment of the orthography
is a very much product of religion or cultural politics in the history.
Following are non Latin/Cyrillic/Greek-based *modern* languages which
I still don't have the data:
Armenian(modern), Assamese, Bengali,
Buginese, Georgian, Hindi,
Kannada, Kashmirti, Kurdish,
Lahnda, Malayalam, Marathi,
Nepali, Panjabi, Pashto,
Persian, Sinhalese, Sundanse,
Tamil, Telgu, Urdu,
Uzbek
Please send your response directly to me, so I can submit the
final summary to the LINGUIST/NLPASIA-L, later. You can tell me:
1) Name of the language,
2) If the language has word-boundary delimiters, or not.
3) Letter Type: Roman/Greek, Cyrillic, Arabic, Devanagari,
Hebrew, Chinese, or other group
4) Note - If you like, short comment.
I appreciate your contribution.
- Hideo Fujii (fujii at cs.umass.edu)
University of Massachusetts
------------------------------------------------------------------------
LINGUIST List: Vol-6-1264.
More information about the LINGUIST
mailing list