6.1264, Sum: Languages With No Between-word Delimiters

Sun Sep 17 15:27:37 UTC 1995

---------------------------------------------------------------------------
LINGUIST List:  Vol-6-1264. Sun Sep 17 1995. ISSN: 1068-4875. Lines:  143

Subject: 6.1264, Sum: Languages With No Between-word Delimiters

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>

Associate Editor:  Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
                   Ann Dizdar <dizdar at tam2000.tamu.edu>
                   Annemarie Valdez <avaldez at emunix.emich.edu>

Software development: John H. Remmers <remmers at emunix.emich.edu>

Editor for this issue: lveselin at emunix.emich.edu (Ljuba Veselinova)

---------------------------------Directory-----------------------------------
1)
Date:  Sun, 17 Sep 1995 08:02:00 EDT
From:  fujii at mackay.cs.umass.edu (Hideo Fujii)
Subject:  Prelim.Summary: languages with no between-word delimiters

---------------------------------Messages------------------------------------
1)
Date:  Sun, 17 Sep 1995 08:02:00 EDT
From:  fujii at mackay.cs.umass.edu (Hideo Fujii)
Subject:  Prelim.Summary: languages with no between-word delimiters

Dear people in LINGUIST and NLPASIA-L,

In Linguist List (Vol-6-1244. 9/14/95,(Thu)), I submitted following
question:

>>   I want to make a list of languages to classify if a (written) language
>>   uses a between-word delimiter (e.g., space in English), or not.
>>   That is, if it doesn't have such delimiters, we need to segment
>>   for the language processing (by human or computer).
>>
>>   You can tell me:
>>	1) Name of the language,
>>	2) Segmentation - Need or No Need,
>>	3) Letters - Use Alphabets (as a group) or not. Or, other graphic
>>          group (Cyrillic, Chinese characters, or Own special, etc.).
>>          No detail.
>>	4) Note - If you like, short comment.

This is a (first) preliminary summary for this inquiry.  I've also included
some my data.

So far I've received 12 responses from following people.  I want to
say thank you for these people.
	From: Shanley Allen <allen at mpi.nl>
	From: Philippe Mennecier <ferry at cimrs1.mnhn.fr>
	From: Stavros Macrakis <macrakis at osf.org>
	From: Dan I. Slobin <slobin at cogsci.Berkeley.EDU>
	From: Boris Fridman Mintz <fridman at ucol.mx>
	From: Allan C Wechsler <Wechsler at world.std.com>
	From: Wolfram Kahl <kahl at hermes.informatik.unibw-muenchen.de>
	From: Stefan Frisch <frisch at babel.ling.nwu.edu>
	From: Doug Cooper <doug at chulkn.car.chula.ac.th>
	From: Nicholas Ostler <nostler at chibcha.demon.co.uk>
	From: Steve Seegmiller <SEEGMILLER at apollo.montclair.edu>
	From: Duncan MacGregor <aa735 at freenet.carleton.ca>

First, I show the list of languages whether or not it has a delimiter symbol
for the 'word' boundary in text like a blank space between words in English:

Q: Does the language have word-boundary delimiters?
  [YES]: Inuktitut(Eskimo), Amharic, Cherokee(?), Arabic(??),
	 Hebrew(Modern), Yiddish(Judeo-German),Ladino(Judio-Spanish),

  [NO]:  Sanskrit, Thai, Lao, Khmer, Burmese(?), Tibetan, Mongolian(?),
	 Manchu(?), Japanese, Chinese, Korean(?)

Here, I excluded historical/classical/medival/extinct languages because
those are not a concern of this survay.

I hope I didn't misunderstand what responders wrote.  If you find mistake
or you can clarify (?)-item in this list, please send me a message.

(Hereafter I will call 'word-boundary delimiter' simply 'delimiter'.
 There are comments about the confusion of terminology such as
 "segmentation" or "separation"; "spaces/blanks", "punctuations",
 "word breaks" or "delimiters"; "segmented" means either "the text is
 'segmented' as is" or "the text must be 'segmented' to separate words".
 I will restate my question at the end of this message.)

At least so far, I didn't see a counter-example to my guessing, i.e.,
most Asian languages don't have delimiters to separate words
no matter the letters have a phonetic or ideo/logographic (except
languages with Romanized characters).

Obviously we don't have enough data to cover many of the typological language
families.  I like to see more languages' data.  I welcome your further
contributions especially for the languages at the end of this message.

I got several valuable comments such as:

1) According to Doug Cooper, there are indian languages which "are segmented,
while others, of similar origin, are not".  If so, it implies that
language's letters are not a definite factor if it has delimiters or not.

2) Even though above Cooper's observation, "it is probably safe to say that
all modern languages that use a Latin-, Cyrillic, or Greek-based writing
system use a blank space as a delimiter" according to Steve Seegmiller.

I've counted the frequencies of Latin-, Cyrillic, or Greek-based languages
using the data in Campbell's Concise Compendium of the World Languages(1995),
in 96 languages.  The result was 63% (61 languages) are one of these three
types.  Althoyugh this data is not sampled typologically fair, but based
on the population of speakers, anyway the establishment of the orthography
is a very much product of religion or cultural politics in the history.

Following are non Latin/Cyrillic/Greek-based *modern* languages which
I still don't have the data:

  Armenian(modern),	Assamese, 		Bengali,
  Buginese, 		Georgian, 		Hindi,
  Kannada, 		Kashmirti, 		Kurdish,
  Lahnda, 		Malayalam, 		Marathi,
  Nepali, 		Panjabi, 		Pashto,
  Persian, 		Sinhalese, 		Sundanse,
  Tamil, 		Telgu,  		Urdu,
  Uzbek

Please send your response directly to me, so I can submit the
final summary to the LINGUIST/NLPASIA-L, later.  You can tell me:

	1) Name of the language,
	2) If the language has word-boundary delimiters, or not.
	3) Letter Type: Roman/Greek, Cyrillic, Arabic, Devanagari,
            Hebrew, Chinese, or other group
	4) Note - If you like, short comment.

I appreciate your contribution.

- Hideo Fujii  (fujii at cs.umass.edu)
  University of Massachusetts
------------------------------------------------------------------------
LINGUIST List: Vol-6-1264.