7.950, Qs: Company names, Wide-character, Dutch dialects

The Linguist List linguist at tam2000.tamu.edu
Sat Jun 29 14:44:21 UTC 1996


---------------------------------------------------------------------------
LINGUIST List:  Vol-7-950. Sat Jun 29 1996. ISSN: 1068-4875. Lines:  176
 
Subject: 7.950, Qs: Company names, Wide-character, Dutch dialects
 
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu> (On Leave)
            T. Daniel Seely: Eastern Michigan U. <dseely at emunix.emich.edu>
 
Associate Editor:  Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
                   Ann Dizdar <dizdar at tam2000.tamu.edu>
                   Annemarie Valdez <avaldez at emunix.emich.edu>
 
Software development: John H. Remmers <remmers at emunix.emich.edu>
 
Editor for this issue: dseely at emunix.emich.edu (T. Daniel Seely)
 
We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then  strongly encouraged to post a summary to the list.   This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.
 
---------------------------------Directory-----------------------------------
1)
Date:  Fri, 28 Jun 1996 10:13:01 +0200
From:  lena at sol.promotor.telia.se (Lena Santamarta)
Subject:  QS; company names & acronyms
 
2)
Date:  Thu, 27 Jun 1996 17:41:10 CDT
From:  Mark at dragonsys.com (Mark Mandel)
Subject:   processing mixed ASCII and wide-character text files
 
3)
Date:  Sat, 29 Jun 1996 14:04:13 GMT
From:  HG4 at soas.ac.uk (HOWARD GREGORY)
Subject:        Dutch dialects (again)
 
---------------------------------Messages------------------------------------
1)
Date:  Fri, 28 Jun 1996 10:13:01 +0200
From:  lena at sol.promotor.telia.se (Lena Santamarta)
Subject:  QS; company names & acronyms
 
 
I'm analysing Swedish company names.
The aim is to improve the pronunciation of such names in a TTS system.
I'm seeking for research results on the structure and syntax of
company names and trademarks. I would also appreciate information about all
kind of research on acronyms.
I promise to send a summary to the list.
 
Thanks!
 
Lena Santamarta
lena at sol.promotor.telia.se
Lena.X.Santamarta at Telia.se
Telia Promotor AB
------------------------------------------------------------------------
2)
Date:  Thu, 27 Jun 1996 17:41:10 CDT
From:  Mark at dragonsys.com (Mark Mandel)
Subject:   processing mixed ASCII and wide-character text files
 
Many writing systems are represented in computer form with 16-bit
("wide") characters, and these often include what I'll call high-
bit bytes: bytes with numerical values of 128 or greater. But many
environments, such as basic email, don't transmit high-bit bytes
reliably; and many programs, including many in-house tools, do not
deal readily with high-bit data, or would need major rewriting to
accommodate it. Certainly there are MIME and UUENCODE, but these
(1) are not available to everyone and (2) apply to whole messages
or chunks of message.
 
Think of a list of words in language X (one of several possible
languages, some not yet determined), with English glosses. There
is a standard coding for the characters of language X, and it uses
high-bit wide characters. There are also several differently-
structured types of list, some of which have language X only in
the first word of each line, but others of which have other
arrangements or may even be random. Language X was input in the
standard coding and will eventually be output that way, but the
processing requires all the data in the file to be in ASCII bytes
(numerically between 1 and 127 inclusive, and preferably avoiding
control characters). The processing may need to detect and
manipulate individual characters of language X, represented as
substrings of 7-bit byte strings, and will definitely need to
manipulate the English at the character level.
 
I would like to have a general encoding system that will
    1. read such an input file,
    2. convert strings of wide, possibly high-bit, characters to
strings of 7-bit bytes in which the wide characters remain
distinguishable in converted form, without changing the English
and other ASCII text already in the file,
    3. later take the output of all the processing and return the
converted strings to the standard high-bit form,
    4. all without reliance on a particular code set.
Assume that the processing programs are smart enough not to mess
up the converted text: if they modify any such text, it will
reconvert correctly to a (modified) string of wide characters.
Similarly, nothing will be introduced in the original-ASCII text
that could be mistaken for converted wide-character text.
 
Such a system would not need to know anything about the characters
of language X, or even know the difference between coded Russian
and coded Japanese, only how to biuniquely convert wide-character
strings to ASCII and back again, without erroneously converting
original ASCII strings (such as English text) to wide-character
strings during reconversion. There may be special escape
characters or sequences to signal the beginning and end of
converted strings, but these by preference should be required only
in the converted form, not the wide-character standard form.
 
Obviously the converted strings cannot be understood as specific
(strings of) displayed characters, such as Cyrillic capital shcha
or Mandarin shi4 (~= 'be'), without knowing the language and code
set of each string; but assume that each file contains only
English and one other (variable) language, which is known for each
file. Also assume that character representation must remain
constant, so that capital shcha is represented by the same ASCII
substring wherever it occurs in its string.
 
It is simple to write such a program. But if it already exists, or
if there's a standard encoding algorithm that satisfies these
requirements and which we can use, it would be silly to reinvent
the wheel.
 
                Mark A. Mandel : mark at dragonsys.com
    Dragon Systems, Inc. : speech recognition : +1 617 965-5200
 320 Nevada St., Newton, MA 02160, USA : http://www.dragonsys.com/
 => KLINGON PAGE: http://www.dragonsys.com/klingon/klingon.html <=
 
 
------------------------------------------------------------------------
3)
Date:  Sat, 29 Jun 1996 14:04:13 GMT
From:  HG4 at soas.ac.uk (HOWARD GREGORY)
Subject:        Dutch dialects (again)
 
I have received several suggestions that I should expand on my query
of a few days ago.
 
I have been following up references in the literature on Dutch
concerning the following types of sentences:
    1)  ... dat Jan de meisjes zijn bevallen
    2)  ... dat Marie de bloemen werden gegeven
.. and their main-clause equivalents. They all have the Indirect
Object (IO) before the other NP.
 
According to what I had read (Koster (1978), Hoekstra (1984), Den
Besten (1985) inter alia), I was expecting to find the following
phenomena:
    1)  The order given is the unmarked way of expressing these
sentences, i.e. they are not a case of contrastive topicalization.
    2)  The IO when fronted in this way is able to control the
Subject of an infinitival adjunct clause (e.g. "na ... terug gekeerd
te zijn", from Hoekstra (1984)).
    3)  Some speakers allow the IO to control verb agreement
(especially in main clauses).
 
Data gathered from several subscribers to the list, who kindly
offered to help me out, did not seem to offer much support for any of
these statements, though they seem reasonably well established in the
literature. I am therefore wondering whether this is a case of
dialectal variation (especially as I have noticed that most of my
informants have addresses in Belgium), or whether the data are simply
controversial. I will be grateful if anybody can shed light on this.
 
Best wishes,
Howard Gregory
hg4 at soas.ac.uk
------------------------------------------------------------------------
LINGUIST List: Vol-7-950.



More information about the LINGUIST mailing list