6.1264, Sum: Languages With No Between-word Delimiters

The Linguist List linguist at tam2000.tamu.edu
Sun Sep 17 15:27:37 UTC 1995


---------------------------------------------------------------------------
LINGUIST List:  Vol-6-1264. Sun Sep 17 1995. ISSN: 1068-4875. Lines:  143
 
Subject: 6.1264, Sum: Languages With No Between-word Delimiters
 
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>
 
Associate Editor:  Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
                   Ann Dizdar <dizdar at tam2000.tamu.edu>
                   Annemarie Valdez <avaldez at emunix.emich.edu>
 
Software development: John H. Remmers <remmers at emunix.emich.edu>
 
Editor for this issue: lveselin at emunix.emich.edu (Ljuba Veselinova)
 
---------------------------------Directory-----------------------------------
1)
Date:  Sun, 17 Sep 1995 08:02:00 EDT
From:  fujii at mackay.cs.umass.edu (Hideo Fujii)
Subject:  Prelim.Summary: languages with no between-word delimiters
 
---------------------------------Messages------------------------------------
1)
Date:  Sun, 17 Sep 1995 08:02:00 EDT
From:  fujii at mackay.cs.umass.edu (Hideo Fujii)
Subject:  Prelim.Summary: languages with no between-word delimiters
 
 
Dear people in LINGUIST and NLPASIA-L,
 
In Linguist List (Vol-6-1244. 9/14/95,(Thu)), I submitted following
question:
 
>>   I want to make a list of languages to classify if a (written) language
>>   uses a between-word delimiter (e.g., space in English), or not.
>>   That is, if it doesn't have such delimiters, we need to segment
>>   for the language processing (by human or computer).
>>
>>   You can tell me:
>>	1) Name of the language,
>>	2) Segmentation - Need or No Need,
>>	3) Letters - Use Alphabets (as a group) or not. Or, other graphic
>>          group (Cyrillic, Chinese characters, or Own special, etc.).
>>          No detail.
>>	4) Note - If you like, short comment.
 
This is a (first) preliminary summary for this inquiry.  I've also included
some my data.
 
So far I've received 12 responses from following people.  I want to
say thank you for these people.
	From: Shanley Allen <allen at mpi.nl>
	From: Philippe Mennecier <ferry at cimrs1.mnhn.fr>
	From: Stavros Macrakis <macrakis at osf.org>
	From: Dan I. Slobin <slobin at cogsci.Berkeley.EDU>
	From: Boris Fridman Mintz <fridman at ucol.mx>
	From: Allan C Wechsler <Wechsler at world.std.com>
	From: Wolfram Kahl <kahl at hermes.informatik.unibw-muenchen.de>
	From: Stefan Frisch <frisch at babel.ling.nwu.edu>
	From: Doug Cooper <doug at chulkn.car.chula.ac.th>
	From: Nicholas Ostler <nostler at chibcha.demon.co.uk>
	From: Steve Seegmiller <SEEGMILLER at apollo.montclair.edu>
	From: Duncan MacGregor <aa735 at freenet.carleton.ca>
 
First, I show the list of languages whether or not it has a delimiter symbol
for the 'word' boundary in text like a blank space between words in English:
 
Q: Does the language have word-boundary delimiters?
  [YES]: Inuktitut(Eskimo), Amharic, Cherokee(?), Arabic(??),
	 Hebrew(Modern), Yiddish(Judeo-German),Ladino(Judio-Spanish),
 
  [NO]:  Sanskrit, Thai, Lao, Khmer, Burmese(?), Tibetan, Mongolian(?),
	 Manchu(?), Japanese, Chinese, Korean(?)
 
Here, I excluded historical/classical/medival/extinct languages because
those are not a concern of this survay.
 
 
I hope I didn't misunderstand what responders wrote.  If you find mistake
or you can clarify (?)-item in this list, please send me a message.
 
(Hereafter I will call 'word-boundary delimiter' simply 'delimiter'.
 There are comments about the confusion of terminology such as
 "segmentation" or "separation"; "spaces/blanks", "punctuations",
 "word breaks" or "delimiters"; "segmented" means either "the text is
 'segmented' as is" or "the text must be 'segmented' to separate words".
 I will restate my question at the end of this message.)
 
At least so far, I didn't see a counter-example to my guessing, i.e.,
most Asian languages don't have delimiters to separate words
no matter the letters have a phonetic or ideo/logographic (except
languages with Romanized characters).
 
Obviously we don't have enough data to cover many of the typological language
families.  I like to see more languages' data.  I welcome your further
contributions especially for the languages at the end of this message.
 
I got several valuable comments such as:
 
1) According to Doug Cooper, there are indian languages which "are segmented,
while others, of similar origin, are not".  If so, it implies that
language's letters are not a definite factor if it has delimiters or not.
 
2) Even though above Cooper's observation, "it is probably safe to say that
all modern languages that use a Latin-, Cyrillic, or Greek-based writing
system use a blank space as a delimiter" according to Steve Seegmiller.
 
I've counted the frequencies of Latin-, Cyrillic, or Greek-based languages
using the data in Campbell's Concise Compendium of the World Languages(1995),
in 96 languages.  The result was 63% (61 languages) are one of these three
types.  Althoyugh this data is not sampled typologically fair, but based
on the population of speakers, anyway the establishment of the orthography
is a very much product of religion or cultural politics in the history.
 
Following are non Latin/Cyrillic/Greek-based *modern* languages which
I still don't have the data:
 
  Armenian(modern),	Assamese, 		Bengali,
  Buginese, 		Georgian, 		Hindi,
  Kannada, 		Kashmirti, 		Kurdish,
  Lahnda, 		Malayalam, 		Marathi,
  Nepali, 		Panjabi, 		Pashto,
  Persian, 		Sinhalese, 		Sundanse,
  Tamil, 		Telgu,  		Urdu,
  Uzbek
 
Please send your response directly to me, so I can submit the
final summary to the LINGUIST/NLPASIA-L, later.  You can tell me:
 
	1) Name of the language,
	2) If the language has word-boundary delimiters, or not.
	3) Letter Type: Roman/Greek, Cyrillic, Arabic, Devanagari,
            Hebrew, Chinese, or other group
	4) Note - If you like, short comment.
 
I appreciate your contribution.
 
- Hideo Fujii  (fujii at cs.umass.edu)
  University of Massachusetts
------------------------------------------------------------------------
LINGUIST List: Vol-6-1264.



More information about the LINGUIST mailing list