27.939, Qs: Lexicography and variation: big data via Google?

Mon Feb 22 17:23:14 UTC 2016

LINGUIST List: Vol-27-939. Mon Feb 22 2016. ISSN: 1069 - 4875.

Subject: 27.939, Qs: Lexicography and variation: big data via Google?

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Anna White <awhite at linguistlist.org>
================================================================

Date: Mon, 22 Feb 2016 12:23:05
From: Stefan Dollinger [stefan.dollinger at sprak.gu.se]
Subject: Lexicography and variation: big data via Google?

Dear colleagues,

While the use of internet data in lexicography is nothing new, the question
has been raised how to best normalize the ''big and messy'' data on the
internet using site-restricted searches (SRSs). SRSs have been employed to
obtain information on the regional variation of a given term (and, ideally, a
given meaning), yet some issues remain unresolved. The question of how to
phrase searches to target specific meanings is perhaps the most challenging
aspect, yet by far not the only one.

An interesting discussion is developing in this forum:
https://www.academia.edu/s/1a487c74ab?source=link

I wonder if anyone has used black box commercial search engines such as
Google, which, despite all its shortcomings and annoyances, offers a
temptingly large, in fact the largest, index in the world. Other search
engines, e.g. exalead.com, are more precise, yet their index is smaller.

My question: Does anyone have experience, or can anyone add to the methodology
presented in the above discussion forum? 

As the issues raised relate to a number of linguistic approaches, I would ask
primarily for input for open class lexical items, which show, in contrast to
most grammatical items, very low frequency counts. It is important that
participants consider this aspect which means, as shown in the discussion
paper, that existing web-scaled resources (of 12 billion words etc.) are still
much too small to assist in regional labelling of lexical items.

Thanks for your input. You are welcome to post directly in the forum on
academia.edu, on the entire approach or an any aspect of the paper (click on
relevant text passage to open a dialog box for your comment). 

I will post a summary on linguist list.org.

Thanks for considering to offer your expertise.
Stefan

Linguistic Field(s): Anthropological Linguistics
                     Applied Linguistics
                     Computational Linguistics
                     Historical Linguistics
                     Lexicography
                     Ling & Literature
                     Semantics
                     Sociolinguistics
                     Text/Corpus Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-27-939	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.org/