summary of the databas-query

Tue May 26 09:42:44 UTC 1998

        Dear collegues,
some time ago I sent a query about the computerized databases created and used for typological research. I got rather many descriptions of various databases as well as a number of e-mails, where authors asked me to inform them about the results of the query. I would like to thank everybody who  spent his/her time filling out my questionnaire and all those who showed their interest in the topic. Below I would like to present a summary of queries received per e-mail and short descriptions of the databases presented at the workshop "Typological Databases" held in Konstanz, Germany on April 24-25 and funded by DFG (German Research Council)-Programm "Language Typology".

Best wishes

Elena Filimonova
University of Konstanz

Once again the query:

What kind of information is  collected in Your database?
(KIND OF INFORMATION)
How many languages do you cover?
(NUMBER OF LANGUAGES COVERED)
What purpose do you use it for? (PURPOSE)
What kind of software do you use? (SOFTWARE)

-----Typological databases-----

        David Gil (Kuala Lumpur)
KIND OF INFORMATION: A corpus of spontaneous speech specimens which I have been recording for the last several years, with various annotations and
gramatical information provided for each specimen.
NUMBER OF LANGUAGES COVERED: Depends on how you count.  The data base is limited to so-called dialects of Malay / Indonesian, and includes maybe 10-15 of these.  But many are non-mutually-intelligible and should probably be considered as separate languages.
PURPOSE: Basic research into the grammar of Malay / Indonesian dialects
SOFTWARE: FileMaker 3.0 for Macintosh.

        Randy J.LaPolla  (Hong-Kong)
KIND OF INFORMATION: over 100 datapoints on each
language, including nominal morphology (both derivational and relational),
isomorphic patterns of such morphology (e.g., are ergative and instrumental
arguments marked with the same form?), verbal morphology, word order patterns, pronouns, and a few other Sino-Tibetian-specific items, relating to classifiers, etc.
NUMBER OF LANGUAGES COVERED: about 150 Sino-Tibetan languages and dialects (some more or less complete depending on the sources).
PURPOSE: long-running project on Sino-Tibetian morphosyntax. Several papers on the basis of this database were published.
SOFTWARE: Hypercard

        John Hajek (Melbourne)
KIND OF INFORMATION: sound systems of the New Guinea area:
        (1) contrastive phonemes - CC and VV
        (2) allophones for each phoneme
        (3) word-structure
        (4) syllable-structure
        (5) phonotactics
        (6) allophonic/phonological processes
        (7) basic prosodic information
NUMBER OF LANGUAGES COVERED: 250 languages
PURPOSE: When completed, I will be able to ask specific questions such as:
        (1) how many CC, VV, etc....
        (2) max number of CC, VV
        (3) does presence of X result in Y.....
SOFTWARE: 4D (relational database software)

(The original model is the database and analysis used by Ian Maddieson in
Patterns of Sounds (1984), with obvious increase in the range of
information, etc....)

        Matthew S. Dryer (Buffalo)
NUMBER OF LANGUAGES COVERED:
855 languages in current database, but for many of them only partial
information is coded
PURPOSE: Investigating typological correlations and geographical distributions
SOFTWARE: own software, written for MAC
KIND OF INFORMATION: M. Dryer has sent a very detailed description of his database which contains 31 datapoints mostly on word order typology  and other fairly basic typological morphosyntactic information. Because of the lack of space it is impossible to quote  his entire description here. But here are his datapoints:
        the basic order of nominal subject, nominal object, and verb;
        the basic order of genitive and noun, type of structure used in genitive construction;
        the basic order of attributive (modigying) adjective  and noun, information on the word class of adjectives: are they verbs or nouns? do they take copula;
        the basic order of adposition and NP;
        the basic order of demonstrative and noun, the basic order of article and noun, the basic order of classifier and noun. Is the definite article the same as the demonstrative? Is the definite article the numeral "one"? Can articles co-occur with demonstratives?
        the basic order of numeral and noun;
        the basic order of relative clause and head noun, structure of relative clause, position of relative marking relative to relative clause;
        the basic order of noun stem and case affix;
        the basic order of manner adverb and verb, position of adverb relative to subject, object, verb;
        the basic position of pronominal subject affixes on the verb, status of pronominal subject affixes with respect to role at clause level, position of independent subject pronouns relative to the verb, status of independent subject pronouns with respect to role at clause level;
        the basic position of pronominal object affixes on the verb;
        the basic position of "auxiliaries" relative to the lexical verb, position of Aux relative to subject, object, verb, categorial status of Aux;
        the basic position  of tense/aspect affixes relative to the verb stem;
        the basic position of modality morphemes relative to the lexical verb;
        the basic position of morphemes indicating desire of subject  relative to the lexical verb or stem denoting the event wanted;
        the basic position of causative morphemes relative to lexical verb or ver stem;
        the basic position of negative morphemes relative to main/lexical verb or verb stem, position of negative relative to subject, object, and verb, order of negative and tense-aspect auxiliary;
        the basic order of adjective and intensifier;
        the basic order of adjective, marker, and standard in comparatives, the basic order of marker and standard, the basic order of adjective and standard;
the basic position of qustion particles or affixes (used to indicate a yes/no question) relative to the main verb or verb stem;
        whether interrogative words in content questions obligatorily occur in sentence-initial position or not;
        the basic position of adverbial subordinators relative to the rest of the subordinate clause, the basic order of adverbial subordinate clasue and main clause, the basic order of pirpose complement with respect to main  verb;
the basic order of direct objects relative to indirect objects, marking of indirect objects;
        the basic order of complementizer and clause;
the basic order of copulative morphemes and predicates, the categorial status of copulas. Do adjective predicates occur with a copula? Do locative predicates occur with a copula? Basic order of morpheme meaning "become" relative to predicate;
        the basic order of adpositional phrases relative to the verb, the basic order of oblique nominals (incl adp. phrases) relative to the subject, object, and verb;
        the basic order among modifiers of the noun, including Adj, Num, Dem, Art, and Gen;
        the basic order among verbal affixes;
        basic order of interrogative words meaning "which" and noun, basic order of other modifying interrogative words relative to noun;
        basic order of universal quantifier and noun;
        the basic order of plural words or affixes relative to the noun;
        the basic position of pronominal affixes on possessed nouns indicating person and/or number of possessor, the basic position of possessive words, pronominal words indicating person and/or number of possessor modifying nouns.

        Vladimir A. Plungjan, Igor' A. Shoshitajshvili, V.Ju. Gusev (Moscow)
        Database "Verbum":

KIND OF INFORMATION: description of grammatical categories of Verb.
Each field of the database contains the following information:
(1) Language name.
(2) Category name.
(3) Manifestation  of a given category (morphemic or syntactic construction is given).
(4) Type of manifestation of the grammatical category.
(5) Degree of regularity of a given language.
(6) A list of grammatical meanings expressed cumulatively.
(7) A list of grammatical meanings expressed by the same morpheme or consruction described excluding the given meaning.
(8) A list of grammatical meanings forbidding the occurence of a described meaning in the same construction.
(9) Comments.
NUMBER OF LANGUAGES COVERED: 60 languages from different language groups
PURPOSE: The database alows for queries concerning a given language, a way of realization of a grammatical category or a set of different parameters, assuring and adequate versatility in the data access and data proccessing.
! The transformation of the Database into WWW is planned!
SOFTWARE: MSAccess 2.0.

Claus-Dieter Pusch (Freiburg)
works with the machine readable corpus of Occitan. This (Mini-)Database is the starting point of the project called MERCATUS = Maschinenlesbare Einrichtung Romanischer Corpora auf Audio-und Text-Unterstuetzenden Speichermedien (Machine-readable adaptation of Romance spoken language corpora on multi-media data carriers).
KIND OF INFORMATION: spoken texts of Romance languages
NUMBER OF LANGUAGES COVERED: the goal to document with CD-ROM-Korpora  possibly all Romance Languages (incl. Romance Creols).
SOFTWARE: MS Access 2.0.

---The following databases were represented at the Workshop in Konstanz---

        Morphosyntactic Database (MSD)
under development by Balthasar Bickel and Johanna Nichols,
with help from John B. Lowe on the computational aspects of the project

KIND OF INFORMATION:
        1. Syntactic Constraint Patterns
including information about syntactic patterns like control constructions, anaphora resolution principles, etc., that are sensitive to some sort of grammatical relation or phrase structure
        2. Argument Marking Patterns
including information about head/dependent marking, alignment, case
alternation and related issues
        3. Agreement Patterns
including information about systematic agreement mismatches in number,
person, etc. features
        4. Genetic Affiliation
based on a systematic time-depth scale ranging from 'lowest subbranch'
over 'stock' (e.g., Indo-European) to 'historical pool' (e.g., 'Australian')
        5. Areal Distribution and Location
based on a subcontinent-sized 18 area break-down of the world;
latitude/longitude information for future map creation
NUMBER OF LANGUAGES COVERED:
The total number of languages on which we have collected some of the
above information is around 250, though so far only about a dozen are in
the actual prototype database.
PURPOSE: The MSD has two purposes:
        1. The MSD is a tool for typologically adequate language analysis and
research on cross-linguistically viable concepts. The range of possible
entries in fields is not pre-defined and every new language has the
potential of adding new options for field entries or even revising
previous entries. Therefore, the MSD does not presuppose but rather
produces precisely defined lists of morphosyntactic notions (e.g.,
various grammatical relations, syntactic patterns like control, switch
reference, etc.) that can be recognized across languages.
        2. The MSD is a collection of data for future statistical analyses of
possible typological implications and areal distribution.
SOFTWARE: For the prototype, we use FileMaker Pro 3 for the Macintosh.
SPECIAL FEATURES:
The MSD has a radically modular architecture, i.e., it consists of a
network of related files with information on narrowly defined domains
(e.g., argument marking alignment, or grammatical relations). The
network is accessed through a central file that pulls together data from
each language. New central files, with different emphasis and selection
of data, can be created at any time and files with information on other
domains can be integrated into the network.

        Database SYLLTYP:
        Caroline Féry & Ruben Van de Vijver (Tuebingen)

KIND OF INFORMATION:
The database SYLLTYP is constructed in cooperation with a research group
from Leiden (Dr. Harry van der Hulst).
SYLLTYP collects information about the segments of different languages and
their phonotactics: vowel and consonant-inventories of languages and their positional properties. Furthermore, there is information about morphology and stress.
NUMBER OF LANGUAGES COVERED:
Right now, about 20 languages are included in SYLLTYP. By there end of the
year there will be betwen 100 and 120 languages. The cooperation with Leiden  allows us to speed the incorporation of languages considerably.
PURPOSE:
SYLLTYP is intended for general linguistic use. It tries to be as theory-independent as possible, so as to make the information contained in it maximally accessible for all kinds of linguists. It should be a help in finding answers to any question related to syllable structure.
SOFTWARE:
The software used is the latest version of 4th dimension which is compatible with both Macintosh and IBM.
! The final version of SYLLTYP will be put on the web!

Norbert Braunschweiler, Jennifer  Fitzpatrick-Cole & Aditi Lahiri (Konstanz)

KIND OF INFORMATION:
The database is being set up for a project on intonational analysis. We
analyse entire books, which are read by one professional speaker. One book
is 4-8 hours of speech for each speaker. Each record contains a single
sentence from a whole story or novel. This sentence is described by a
number of fields including sentence-text, transcription of sentences as it
was spoken, translation to english, annotations, waveform (visual),
f0-contour, wordlabels, acoustic speech signal, phonological phrasing and
intonational tones.
NUMBER OF LANGUAGES COVERED:
6 (British English, American English, East Bengali, West Bengali, Standard
German, Bern German). At present we have one speaker for each language.
Also, we have the German and Bengali translated into English, and the Bern
German translated into Standard German.
PURPOSE: Data storage, information retreival, presentation.
SOFTWARE: The database comes in two parts.  Part of it is stored on the Macintosh in Filemaker 4.0, part of it is stored on the Indy for analysis with Xwaves.

        Universals Database:
        Frans Plank & Elena Filimonova (Konstanz)

KIND OF INFORMATION: Any language  universals  ever proposed in the linguistic literature - statements like "If a language has feature X, then it has feature Y". Each universal is documented in an adequate way which means we list author and work, the empirical basis and known counterexamples. Keywords classify the content of the universal.
PURPOSE: Collection, documentation, and systematization of language universals, as well as different kinds of research for interdependencies between language features.
        We also have an idea to connect the correlations within a network (If a language has feature X, then it has feature Y, and if it has Y , it has Z etc.) in order to reconstruct what Sapir called "the great underlying groundplans of language" .
        We do not intend to find correlations by our own empirical research; our task is the documentation of already existing universals. Nor do we intend to create an encyclopedia of languages and their characteristics and induce correlations.
SOFTWARE: FileMakerPro 4.0.

! We are also going to publish our database in internet. The necessary technical work is already  done. We will inform you as soon as the database will be online!

Apart from our own database we want to point out that there is another universal collection concentrating on the Noun phrase, which is a project in the EUROTYP-Programme led by F. Plank. The collection of 239 (mostly implicational) noun phrase universals was published by Simon Kirby (University of Edinburgh) in the worldwide web (ling.ed.ac.uk; http://www.ling.ed.ac.uk/eurotyp/) and can be used for research. This database forms the model for our own work.

        Intensifiers and reflexives database.
        Ekkehard Koenig, Peter Siemund & Daniel Hole (Berlin)
KIND OF INFORMATION:
Typological information on intensifiers (self) and reflexives, such as uses of intensifiers, types of intensifiers, parameters of variation, examples: original, gloss, translation. Below are some examples of the datapoints:
- does the language draw a distinction between intensifier and
reflexive or not? English: 'x-self', German 'sich'/'selbst'
- is the intensifier restricted to certain NPs? Turkish: 'kendi' is
only used with human NPs
- does the intensifier have derived uses? French: 'meme' can be
used as a scalar focus particle; Latin: 'ipse' can be used as a
pronoun...
NUMBER OF LANGUAGES COVERED: about 60
PURPOSE: Electronic file manager, statistical purposes, fast data retrieval tool.
SOFTWARE: MS Access

        Database on the Dual
        Frans Plank & Wolfgang Schellinger (Konstanz)

MAIN GOAL:Adequate description of the categorial infrastructure of the dual in the languages of the world; search for cross-linguistic correlations between
parameters.
KIND OF INFORMATION: Primarily morphosyntactic data, such as form, paradigm structure, restrictions with respect to other categories, both for the dual and the plural, for all relevant word classes. Also, special usages such as
elliptic and sylleptic duals, historical development as far as reconstructable, optionality, dual conceptions (pair vs. accidental twoness). In most cases, data are analytical and based on printed grammars, the analyses their authors offer, and our own interpretations of the facts presented there.
NUMBER OF LANGUAGES: In principle, the number of languages included in the database in not limited. We try to document as many languages as possible in order to get reasonably close to the full picture. Out of the whole set of languages samples are constructed for statistical analyses. The "dual sample" will ultimately contain approximately 240 languages and will be supplemented by a "plural sample" (i.e., languages with plural but lacking dual number) of roughly the same size. The upper limit for dual languages is (according to our ongoing survey of whatever materials we can get hold of) ca. 900 languages.
SOFTWARE: The very first version of the database used FileMaker 1.0 for Macintosh. Meanwhile it was converted, without restructuring, into FileMaker 2.0 and, only recently, into FileMaker 4.0. It is currently being overhauled so as to fully exploit the potential of version 4.

----Comparative databases----

        Seth Jerchower (New York)

DB of historic corpora: Judeo-Italian translations from the 14th/15th  and 16th centuries + short grammatical characteristics: "Morphology (N,V, ADJ, etc.) and Syntax (coordination of aspect/tense, genitive constructions, Prep phrases)"
SOFTWARE: MS Access

        Ann Lindvall (Lund)

DB of 6114 transitive clauses in (modern) Greek, Polish and Swedish.
The purpose is to study semantic properties of transitive clauses.
SOFTWARE: FileMakerPro.

----Databases on the single languages----
        Johannes  Heinecke (Berlin)
is working on Chechen-German database.
The main PURPOSE is to generate a Chechen-German dictionary including
  a German-Chechen index.
- The database is also used to generate a machine-readable full-form-lexicon
  which is used for a Chechen Part-of-Speech Tagger (for written texts).
  A inflection algorithm expands the lemmata of the original database
  into some 200 000 inflected forms.
- A third application, currently under implementation is a dependency grammar   of Chechen to be used with a parser. The parsing lexicon is to be generated from the database.
KIND OF DATA: The head entries are lemmata. Each entry is complemented by phonologic/phonetic information, inflection class, irregular (and some regular) inflected forms, and German equivalents. Examples and/or idioms are added if necessary.
SOFTWARE:  The dictionary software is mainly a programme writte in C some years   ago. This software sorts the head entries according to Chechen orthography   (cyrillic) and produces a LaTeX-source file which is then linked to   a LaTeX-mainfile (including introductionary sections etc) to   generate the camera-ready version.
The inflected form lexicon for the POS-tagger and parser is generated by
a programme written for the LeX4-Lexicon formalism. This formalism was
developped in conjunction with the Verbmobil-Project.

         Marc Eisinger (Paris)
is working with a  database dealing with Nahuatl, an agglutinative language of Mexico. The DB is 2-column only: morphs and words. It is used to answer two symetric questions: (1) given a word what are the morphs it is composed with, (2) given a morph, what words have that morphs in  composition.
SOFTWARE: Lotus Approach. But any kind of relational Database is ok.  MSAccess  can be used as well.

        Suzanne E Kemmer (Houston)
has been constructing a database on Luo (Nilo-Saharan) using File-Maker Pro, and which can be made generally accessible via the world wide web. (it is already in web-interfaceable format).
S.Kemmer writes: "I would like to see many such databases accessible to typologists and specialists in particular language families. The link to the web will make doing comparative work, including typological comparison,
much easier if there are many such databases for different languages."