[Ads-l] programmatic generation of word lists for specialized dictionaries?

Tue Jan 12 20:45:23 UTC 2016

hi tim,

i like your approach and discuss your results and ideas.
our lexicography approach is mainly coming up from already existing dictionaries and support transformation against the background on social innovation and research infrastructures;
but due to our strong corpus based and language technologies expertise (NLP, semantic technologies, corpus linguistics) we are experimenting towards similar results like you,
e.g. word lists and dictionary candidates concerning language variation from social media and interlinking with linked (open) data of different provenience.

i strongly invite you to discuss your topic in the framework of the LREC congres in the first globalex workshop:  http://ailab.ijs.si/globalex/dates-and-submission/

i would be pleased to get in contact and discuss your work and possibilities of collaboration.

warm regards from vienna:
eveline wandl-vogt

***

coordinator research group "lexicography laboratory" @ austrian centre for digital humanities @ österreichische akademie der wissenschaften (ÖAW) [austrian academy of sciences] | 1040 wien. AT | wohllebengasse 12-14/2 |
http://www.oeaw.ac.at/acdh/ | http://wboe.oeaw.ac.at

________________________________________
Von: American Dialect Society [ADS-L at LISTSERV.UGA.EDU]" im Auftrag von "Tim Stewart [timoteostewart1977 at GMAIL.COM]
Gesendet: Dienstag, 12. Jänner 2016 17:49
An: ADS-L at LISTSERV.UGA.EDU
Betreff: programmatic generation of word lists for specialized dictionaries?

---------------------- Information from the mail header -----------------------
Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
Poster:       Tim Stewart <timoteostewart1977 at GMAIL.COM>
Subject:      programmatic generation of word lists for specialized
              dictionaries?
-------------------------------------------------------------------------------

I've written a specialized dictionary, and now I'm attempting to write an
article about how I used a computer program to help generate the word list
for it. I'm curious how often other lexicographers have employed a
programmatic approach to generating a list of hypothetical forms and then
testing those forms against corpora to determine which forms represent
lexical items in use (I believe it has been done at least once
before---details below). So far my efforts to dig up information about this
topic in JSTOR and other academic databases have been fruitless. Maybe the
ADS list can help!

My dictionary contains 350 lexical items, each of which is a blend of two
(or more) names of Christian denominations. Examples of these items are
*bapticostal* (*Bapti*st + Pente*costal*), *fundagelical* (*funda*mentalist
+ evan*gelical*), and *lutholic* (*Luth*eran + Cath*olic*). All the items
are formed by blending syllables from a small set of 23 names of
denominations (Anglican, Baptist, Catholic, Episcopal, etc.). Given the
very narrow morphological and phonological criteria involved, it occurred
to me to generate a list of possible items by programmatically combining
parts of the names of these 23 denominations. Then I conducted searches for
these hypothetical forms against corpora and online text databases to
determine which forms I could find evidence for. I don't have the exact
results in front of me, but my computer program generated several thousand
hypothetical forms, and my searches then turned up quotational evidence for
around 100 terms. So the success rate was somewhere in the neighborhood of
2%.

So, on to my question... have there been other dictionaries whose word list
was (partly) generated using a method of programmatically generating
hypothetical forms and then winnowing the word list?

My understanding is that it has happened at least once before. In
their *Dictionary
of Krio-English* (OUP, 1980) Fyle and Jones describe a method they used
back in the early 1970s to rapidly build up their Krio word list:

=E2=80=9CA search for all known monosyllables in the language, using native=
-speaker
competence. The method was simply to note all the combinations of
consonant(s) + vowel + consonant(s) ((C^n)V(C^n)) allowable by the
phonology of the language, and to record all those that turned out to be
actual Krio monosyllables. This search yielded well over 1,000
monosyllables=E2=80=9D (xii).

Any help is appreciated.

Tim Stewart
tim at dictionaryofchristianese.com

P.S. For those who may be curious about this project and want to know more
about it, see http://www.dictionaryofblendeddenominations.com for a brief
description.

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org