26.2325, FYI: The Gavagai Living Lexicon

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Mon May 4 18:35:26 UTC 2015


LINGUIST List: Vol-26-2325. Mon May 04 2015. ISSN: 1069 - 4875.

Subject: 26.2325, FYI: The Gavagai Living Lexicon

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*************    LINGUIST List 2015 Fund Drive    *************
Please support the LL editors and operation with a donation at:

              http://funddrive.linguistlist.org/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================


Date: Mon, 04 May 2015 14:33:46
From: Magnus Sahlgren [mange at gavagai.se]
Subject: The Gavagai Living Lexicon

 We are proud to announce the release of the Gavagai Living Lexicon - an online lexicon that gives you access to the knowledge our distributional semantic models gather about terms in language as it is used by people in every corner
of the known world.

The lexicon is available at: http://lexicon.gavagai.se

The lexicon is based on Gavagai's distributional semantic models that learn language constantly from live data feeds with millions of documents per day from both social and news media. This means that the living lexicon is
continuously evolving and always à jour with current language use. As an example, try searching for some topical term like ''earthquake'' (http://lexicon.gavagai.se/lookup/en/earthquake) to see what the lexicon has learned during the last couple of days.

The lexicon currently provides the following information:

- the frequency rank of the term in the lexicon
- similarly spelled terms
- common left and right neighbors (i.e. left and right collocations)
- multi-word units (n-grams) that include the search term
- semantically similar terms (i.e. terms that have been used in a similar way in online data)
- associatively related terms (i.e. terms that have often been used in the same documents as the search term)

Both the semantically similar and associatively related terms are automatically grouped into clusters of similar and related terms, respectively. The semantic groups are also labelled with the most common collocations. You can think of the labels as an explanation for why the terms are clustered together. As an example, try searching for ''apple''
(http://lexicon.gavagai.se/lookup/en/apple). You can see that the distributional semantic model has learned a number of different usages of ''apple," including apple as an ingredient, apple as a product, apple as a stock, and apple as a fruit. Another example is a search for ''suit'' (http://lexicon.gavagai.se/lookup/en/suit), which demonstrates that the
lexicon has learned both the garment sense and the legal sense.

The lexicon is currently available in Arabic, Danish, English, Estonian, Finnish, German, French, Latvian, Lithuanian, Norwegian, Portuguese, Russian, Spanish, and Swedish. More languages will be added continuously. The size of
the vocabulary for each language depends on the amount of online data we listen to for that particular language. English is currently the largest language in the lexicon, with a vocabulary of more than 2,500,000 unique terms. The 200,000 most common of these terms have entries in the English lexicon.

If you are a developer and want to access the lexicon functionality directly through our API, simply sign up for a free developer account at: developer.gavagai.se

Note that our developer APIs also feature functionalities for doing multi-document summarization, tonality analysis, and keyword extraction.

We appreciate any feedback on the lexicon and our APIs. 

Contact us at:
info at gavagai.se

(Publications describing the algorithms behind the living lexicon are under preparation and will be added to the lexicon site once published.)
 
Linguistic Field(s): Computational Linguistics
                     Lexicography
                     Sociolinguistics
                     Text/Corpus Linguistics



----------------------------------------------------------
LINGUIST List: Vol-26-2325	
----------------------------------------------------------







More information about the LINGUIST mailing list