[Lingtyp] Deriving Tns-Asp-Mood orders and Greenberg’s Universal 20 with n-grams
Stela Manova
stela.manova at univie.ac.at
Wed Jul 14 13:58:08 UTC 2021
**
*Dear LINGTYP readers, *
*
Some time ago, after an email exchange with Mattis List on this list on
how Google services process natural language, I was asked to comment on
recent NLP research, esp. on NLP without grammar. I refused for a number
of reasons, personal and professional alike. However, for the past one
and a half years, I have had the chance to experiment with n-grams and
am now ready to demonstrate, based on sequences of elements considered
important in linguistics such as the attested and unattested TAM orders
and Greenber’s Universal 20 and its exceptions, how NLP without grammar
works. Since this research is directly relevant to linguistic typology,
I would be grateful to receive input by typologists. The abstract of the
paper follows:
The linear order of elements in prominent linguistic sequences:
Deriving Tns-Asp-Mood orders and Greenberg’s Universal 20 with n-grams
Stela MANOVA
stela.manova at univie.ac.at <mailto:stela.manova at univie.ac.at>
Current NLP research uses neither linguistically annotated corpora nor
the traditional pipeline of linguistic modules, which raises questions
about the future of linguistics. Linguists who have tried to crack the
secrets of deep learning NLP models, including BERT(a bidirectional
transformer-based ML technique employed for Google Search), have had as
their ultimate goal to show that deep nets make linguistic
generalizations. I decided for an alternative approach. To check whether
it is possible to process natural language without grammar, I developed
a very simple model, the End-to-end N-Gram Model (EteNGraM), that
elaborates on the standard n-grammodel. EteNGraM, at a very basic level,
imitates current NLP research by handling semantic relations without
semantics. Like in NLP, I pre-trained the model with the orders of the
TAM markers in the verbal domain, fine-tuned it, and then applied it for
derivation of Greenberg’s Universal 20 and its exceptions in the nominal
domain. Although EteNGraM is ridiculously simple and uses only bigrams
and trigrams, it successfully derives the attested and unattested
patterns in Cinque (2005) “Deriving Greenberg's Universal 20 and Its
Exceptions”, Linguistic Inquiry 36, and Cinque (2014) “Again on Tense,
Aspect, Mood Morpheme Order and the “Mirror Principle”.” In Functional
Structure from Top to Toe: The Cartography of Syntactic Structures9.
EteNGraMalso makes fine-grained predictions about preferred and
dispreferred patterns across languages and reveals novel aspects of the
organization of the verbal and nominal domain. To explain
EteNGraM'shighly efficient performance, I address issues such as:
complexity of data versus complexity of analysis; structure building by
linear sequences of elements and by hierarchical syntactic trees; and
how linguists can contribute to NLP research.
The full text is available at: https://ling.auf.net/lingbuzz/006082
<https://ling.auf.net/lingbuzz/006082>
Many people believe that numbers serve primarily for counting -- money
and things that can be bought with money. Yet, against any logic, the
noun money is uncountable in e.g. modern English; and as usual in
linguistics, the problem becomes even more complicated if one looks at
languages other than English, e.g. in my mother tongue, Bulgarian, money
is pluralia tantum. I hope that my math-oriented research reveals the
fascinating world of numbers in a more convincing way. :)
Best wishes,
Stela
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20210714/eccd94bb/attachment.htm>
More information about the Lingtyp
mailing list