8.797, Qs: Lang id, Student corpus, Syntax

Thu May 29 17:15:06 UTC 1997

LINGUIST List:  Vol-8-797. Thu May 29 1997. ISSN: 1068-4875.

Subject: 8.797, Qs: Lang id, Student corpus, Syntax

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            T. Daniel Seely: Eastern Michigan U. <seely at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Associate Editors: Ljuba Veselinova <ljuba at linguistlist.org>
                   Ann Dizdar <ann at linguistlist.org>
Assistant Editor:  Sue Robinson <sue at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

Editor for this issue: Ann Dizdar <ann at linguistlist.org>
 ==========================================================================

We'd like to remind readers that the responses to queries are usually
best posted to the individual asking the question. That individual is
then  strongly encouraged to post a summary to the list.   This policy was
instituted to help control the huge volume of mail on LINGUIST; so we
would appreciate your cooperating with it whenever it seems appropriate.

=================================Directory=================================

1)
Date:  Tue, 27 May 1997 13:52:58 -0500
From:  Mark Mandel <Mark at dragonsys.com>
Subject:   Language identification

2)
Date:  Wed, 28 May 1997 01:28:30 +0800
From:  colber at mbm1.scu.edu.tw
Subject:  student corpus - advice sought

3)
Date:  Thu, 1 May 1997 19:34:33 +0900
From:  htanaka at osk.threewebnet.or.jp (Hiroyuki TANAKA)
Subject:  syntax papers

-------------------------------- Message 1 -------------------------------

Date:  Tue, 27 May 1997 13:52:58 -0500
From:  Mark Mandel <Mark at dragonsys.com>
Subject:   Language identification

An acquaintance of my daughter's writes:

 ===================================

 Identify this language please?

"Idolem urodo iatu a wi rot
 Ukufu kush onuoy nehawuoch
 Etia di ukoik ura nakurah
 Enadu yoimi nnesar urugem
 Eteako ich atak
 Ureatu tso oodah
 Amia wibo koro yonneie"

I think I have a pretty good idea of what languages this is *not* (not
a Romance language, not Germanic, not Slavic, not Chinese, Japanese,
Vietnamese...).  Also, if it translates to something really corny,
lemme know so I can stop embarrassing myself every time I sing it.

 ===================================

Please respond to me. I will forward replies to the inquirer and
summarize to the list. Thank you for any help.

       Mark A. Mandel : Senior Linguist : mark at dragonsys.com
    Dragon Systems, Inc. : speech recognition : +1 617 965-5200
 320 Nevada St., Newton, MA 02160, USA : http://www.dragonsys.com/
           Personal home page: http://world.std.com/~mam/

-------------------------------- Message 2 -------------------------------

Date:  Wed, 28 May 1997 01:28:30 +0800
From:  colber at mbm1.scu.edu.tw
Subject:  student corpus - advice sought

Has anyone in the List compiled or worked with STUDENT corpora?

I am in the process of putting together a corpus of Chinese college
students' unedited writings in English.  The purpose is to
subsequently analyze this corpus, with concordancer and other
programs, and find quantitative information about the extent of some
characteristic errors or other non-native speaker word usages in their
writings.  This information can be very valuable in determining
syllabuses and directions in secondary school English instruction.

The corpus is planned to be the size of about 300,000 words,
consisting of 800-1000 pieces of written assignments, each anywhere
between 150-400 words long, typed and saved as text files.  About one
third of these assignments has already been typed (entered).

I haven't so far used any other STUDENT corpus, from any country.  So
my question is: are there any STANDARDS, generally used or accepted
electronic formats, in which these corpora are compiled, saved, and
prepared to be used by others?

Here I briefly describe how the corpus is being compiled here, and
will be very grateful for suggestions or comments whether this way is
OK or any change should be made to comply with accepted forms.

- Each piece is typed in the Word 6.0 window (in Windows 3.1
environment), using a fixed space font, making each line about 70
words long, typing the unedited, uncorrected text (only obvious
spelling and punctuation mistakes made by the students are corrected).

- An 8-12 character code (number) is typed in the first line.  Then
one line is skipped, and the heading (headline) of the piece, as
written by the student, is typed.

- Paragraphing follows the original, with blank lines between the
paragraphs.

- Before saving the text, possible spelling and other errors made in
the typing process are checked and corrected using Word's spell
checker.

- Then each piece is saved as a "text only with line breaks" file and
given a file name (number).

- All these files are placed in one directory and backed up to prevent
accidental erasure.

-  Using a simple merger application, the files are merged.

So far, I have already tried using in a concordancer (WordSmith Tools)
a consolidated long file comprising about 350 pieces of writing, about
120,000 words, and there seem to be no problems.  Would files compiled
this way be ALSO USABLE in other concordancer or text
processing/analyzing programs?

Please send your comments either to the List, or to me.  I could
certainly summarize the contents of communications sent to me and send
it to the List.

I should also be very happy to eventually make this corpus available
to anyone interested in using it, or exchange it with similar learner
corpora on file, based on writings of other Chinese or Japanese
students, or English-learning college students in any country.

Best to all,

Colman Bernath

- -----------------
Colman Bernath
c/o Department of English
Soochow University, Taipei, TAIWAN
colber at mbm1.scu.edu.tw

-------------------------------- Message 3 -------------------------------

Date:  Thu, 1 May 1997 19:34:33 +0900
From:  htanaka at osk.threewebnet.or.jp (Hiroyuki TANAKA)
Subject:  syntax papers

Does anyone have (a published version of) the following two papers,
both of which are cited in L. Rizzi's (1990) _Relativized Minimality_?

  Carstens, V., and Kinyalolo. 1989. Agr, Tense, Aspect and
    the IP Structure: Evidence from Bantu. Paper presented at
    GLOW Conference, Utrecht.

  Schneider-Zioga, P. 1987. Syntax Screening. Paper, USC, Los Angeles.

Please contact me at the address below.

    +------/-----------------------------------/------+
    |                Hiroyuki Tanaka                  |
    |       Department of English Linguistics,        |
    |      Faculty of Letters, Osaka University.   +--+
    |      e-mail: htanaka at osk.threewebnet.or.jp   | /
    +----------------------------------------------+

---------------------------------------------------------------------------
LINGUIST List: Vol-8-797