27.2136, Sum: Lexicography and variation: big data via Google?

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Tue May 10 14:18:10 UTC 2016


LINGUIST List: Vol-27-2136. Tue May 10 2016. ISSN: 1069 - 4875.

Subject: 27.2136, Sum: Lexicography and variation: big data via Google?

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Anna White <awhite at linguistlist.org>
================================================================


Date: Tue, 10 May 2016 10:17:35
From: Stefan Dollinger [stefan.dollinger at sprak.gu.se]
Subject: Lexicography and variation: big data via Google?

 
Discussion period: 19 Feb. to c. mid March 2016.

The discussion was centred on this draft paper:
https://www.academia.edu/s/1a487c74ab

And was announced in https://linguistlist.org/issues/27/27-939.html

Summary:

Fifty-two discussants partook in the 21-day Session on academia.edu. I will
attempt to summarize the most salient issues, in my view, below. For other
topics, please refer directly to the Session link (read from the bottom up).
Thanks to all those who gave their time. Apologies to those whose posting(s)
are not reported below, which is not for a lack of appreciation.

One conversation stream, started by Robert Lew, cut right to the validity of
the entire approach. Robert, in measured yet incisive messages, raised a
number of concerns about using Google, one of them rather serious, insisting
that using Google was ''bad science'', in analogy to an Adam Kilgarriff paper
from 2007, which I used as a spring board to reconsider some of Adam's
rejections ''Googleology as Smart Lexicography''. I realized early on that the
discussion centred a bit too much on Google, though this was an inadvertent
reflection of the title. The discussion was intended more as a ''how reliable
are web-searches'' with open access search engines in general, but the Google
focus made it very concrete and tangible and offered insights that go far
beyond Google's ''Black Box'' structure. Robert

Most crucially, Robert pointed out that Google page counts are unreliable and
the figures displayed are, when one clicks through, not matched. This is a
serious problem for any method, such as this one, that relies on Google's
numbers to create its normalized cross-domain indices. It seemed as if the
error is probably proportionally inflated, as the results that were found in
DCHP-2 match what we know about regional patterns of Canadianisms on an
international, as well as on a Canada-internal regional scale (see for a very
concise account
https://www.academia.edu/18967380/How_to_write_a_historical_dictionary_a_sketc
h_of_The_Dictionary_of_Canadianisms_on_Historical_Principles_Second_Edition).
So, while the absolute numbers are off, the ratio between the numbers in
different domains seems to be correct. Robert pursued the issue further and
found a number of infelicities even in the ratio. There clearly is more work
to be done, but the use of more precise search engines is not the panacea it
seems. The question remains, and will be verifiable by everyone once DCHP-2 is
in open access in late 2016, why the results we get are in line with the few
terms whose regional variation we knew and, for the many others, usually make
perfect sense when matched with the extra-linguistic histories of the terms.

My entire point of the paper was that the clean web-scaled resources that
Kilgarriff advocates are still not big enough (e.g. 12 billion words) to
produce the regional data information. So if we would like to have regional
labels in dictionaries, one of the areas lexicographers, as I argue, do worst,
we will have to make the best of suboptimal search engines. This point I make
in the paper with an example. Robert pointed out that the Yandex and Exalead
search engines might be preferable, yet it remains to be checked whether their
indices are large enough to compete with the data from the messy Google
interface.

The point raised in the paper, that Google MUST be tracked and results can
only be confirmed post-hoc with the help of extensive tracking data, is the
key of the method, which, I believe, has been refined. This would apply to
other indices, whether Yandex or Exlead or others as well. So, one take-away
message might be: If you want to argue from frequency, NEVER just search the
web, always track the domain sizes, document them and then search the web.

Is Google messy? No doubt. Do we have an alternative for the kind of tasks
geographically-minded lexicographers need to handle? Not yet. 

So while web-scaled corpora (e.g. TenTen and resources in SketchEngine) and
resources like GloWbE (Mark Davies) are extremely useful, they are way too
small (in the latter case very much so) to contribute to address regional
distributions of lexical searches. 

There is room for more exploration. Lexicographers are no computational
linguists, generally, so any method that would help the former would need to
be simple and effective – or come in the form of an app. That was the idea
behind my paper: practical, but computationally unsophisticated yet sound (or
much sounder compared to current practice). 

Thanks to all, besides Robert Lew, who posted comments, especially Robert
Fuchs, Dorota Lockyer and Victoria Ventura for their various suggestions. I
will incorporate them, with full acknowledgement, to the maximally possible
degree in the final version of the paper.

Thanks for the contributions! Great to be part of the collaborative spirit.

Stefan Dollinger
https://gu-se.academia.edu/StefanDollinger 
Gothenburg, Sweden, 6 May 2016
 

Linguistic Field(s): Anthropological Linguistics
                     Applied Linguistics
                     Computational Linguistics
                     General Linguistics
                     Historical Linguistics
                     Lexicography
                     Ling & Literature
                     Semantics
                     Sociolinguistics



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and
as such can receive donations through Indiana University Foundation. We
also collect donations via eLinguistics Foundation, a registered 501(c)
Non Profit organization with the federal tax number 45-4211155. Either
way, the donations can be offset against your federal and sometimes your
state tax return (U.S. tax payers only). For more information visit the
IRS Web-Site, or contact your financial advisor.

Many companies also offer a gift matching program, such that
they will match any gift you make to a non-profit organization.
Normally this entails your contacting your human resources department
and sending us a form that the Indiana University Foundation fills in
and returns to your employer. This is generally a simple administrative
procedure that doubles the value of your gift to LINGUIST, without
costing you an extra penny. Please take a moment to check if
your company operates such a program.


Thank you very much for your support of LINGUIST!
 


----------------------------------------------------------
LINGUIST List: Vol-27-2136	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.org/








More information about the LINGUIST mailing list