[Lingtyp] Concerns about U.S. policies and linguistic archives

Emily M. Bender ebender at uw.edu
Tue Feb 4 15:53:25 UTC 2025


Please do not appeal to LLMs as any source of security here.

1) If that info is in Common Crawl, then yes, it might be retrievable. But
that is only a snapshot and I don't know how well organized that dataset
is.
2) If you're talking about using LLM output to "recreate" the data, you're
proposing adding enormous amounts of noise. This is the opposite of good
data handling practices.
3) Many archives rightly practice various kinds of access restrictions. If
these restrictions are implemented well, the data meant to be accessible to
community members and/or community-approved researchers should not have
been scraped.

It is my understanding as well that only federal data sources are
immediately in danger, but not all US institutions are heeding the advice
"do not obey in advance".

Emily


On Tue, Feb 4, 2025 at 7:48 AM Stela Manova via Lingtyp <
lingtyp at listserv.linguistlist.org> wrote:

> It seems to me that the problem should not be approached with a
> traditional logic. For example, everything freely available on the web can
> be seen as already archived, more or less, because it has been used as
> training data for Large Language Models.
> Best,
> Stela
>
> ------------------------------
> *From:* Lingtyp on behalf of Juergen Bohnemeyer via Lingtyp
> *Sent:* Tuesday, February 4, 2025 3:33 PM
> *To:* Jocelyn Aznar; lingtyp at listserv.linguistlist.org
> *Subject:* Re: [Lingtyp] Concerns about U.S. policies and linguistic
> archives
>
> Dear Jocelyn – Indeed, we are once again finding ourselves in
> “interesting”, read unprecedented and disturbing, times. Now, I may not be
> in the best position to respond to your query, but any immediate concern
> for the safety of language archives would only relate to things that are
> under the control of the federal government, such as the Library of
> Congress or the National Endowment for the Humanities. And as far as I
> know, these have not been archiving data and records from endangered
> languages.
>
>
>
> I do, however, worry about the Smithsonian Institution in this regard.
> Other than the Smithsonian, the language archive that comes immediately to
> mind, AILLA at UT, is not under the purview of the federal government.
>
>
>
> In any event, beyond the current situation, it seems indeed vitally
> important to connect the world’s digital language archives and create a
> system of mirrors in order to effectively decentralize the data and thereby
> make it less vulnerable to threats on any one site or even country. It’s my
> understanding that the people in charge of the archives are well aware of
> this and have begun to take steps. But it’s a long-haul project, based on
> my very incomplete understanding.
>
>
>
> Best – Juergen
>
>
>
> Juergen Bohnemeyer (He/Him)
> Professor, Department of Linguistics
> University at Buffalo
>
> Office: 642 Baldy Hall, UB North Campus
> Mailing address: 609 Baldy Hall, Buffalo, NY 14260
> Phone: (716) 645 0127
> Fax: (716) 645 3825
> Email: *jb77 at buffalo.edu <jb77 at buffalo.edu>*
> Web: *http://www.acsu.buffalo.edu/~jb77/
> <https://urldefense.com/v3/__http://www.acsu.buffalo.edu/*jb77/__;fg!!K-Hz7m0Vt54!n808WEyerg73DrbZDyQESV5PC1vC84crj91SbL3oDVSU7HcSL0lUF3tQ1ROBNiET4HUvJwYJ4wRkAfUoePWruLtqFnzbvNxdBw$>*
>
>
> Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585
> 520 2411; Passcode Hoorheh)
>
> There’s A Crack In Everything - That’s How The Light Gets In
> (Leonard Cohen)
>
> --
>
>
>
>
>
> *From: *Lingtyp <lingtyp-bounces at listserv.linguistlist.org> on behalf of
> Jocelyn Aznar via Lingtyp <lingtyp at listserv.linguistlist.org>
> *Date: *Tuesday, February 4, 2025 at 05:02
> *To: *lingtyp at listserv.linguistlist.org <lingtyp at listserv.linguistlist.org
> >
> *Subject: *[Lingtyp] Concerns about U.S. policies and linguistic archives
>
> Dear colleagues,
>
> I know this list is primarily meant for discussing ideas and
> observations related to linguistic typology, rather than politics.
> However, current U.S. policies regarding scientific data have led me to
> wonder whether these policies might affect the fields of linguistics and
> humanities.
>
> When I heard about data related to ecology and the environment being
> discarded, I immediately worried the same could happen to linguistic
> archives and datasets. But maybe it is just me, dear colleagues working
> in the US, what do you think? Could this happen as well to archives
> related to linguistics and humanities?
>
> I believe that if we address this issue proactively, we’ll be better
> placed to preserve more data should the need arise. For instance, we
> could check whether the existing infrastructure outside of the US, ELAR,
> HumaNum/Ortolang, Pangloss, Paradisec, etc, would be able to handle or
> help to face such a crisis? or whether we should consider setting up
> some sort of emergency server so that researchers can transfer data at
> risk of being lost?
>
> One possible strategy would be to prepare a brief manual (probably as a
> webpage), after discussing with each institution of course, describing
> which archives outside the U.S. could accept data from an archive from
> the US, in which format, what kind of data would be accepted, etc. Then,
> if needed, U.S based researchers could formulate a plan to safeguard
> their data. By doing that, we could also identify gaps in current
> coverage and, if necessary, establish an emergency archive or server to
> fill those gaps.
>
> Best regards,
> Jocelyn Aznar
>
> ¹ I’m of course also concerned about data from other fields, though I
> feel more competent discussing linguistic data. Still, if we build an
> infrastructure for linguistic data from the U.S., it might be possible
> to scale it up for other disciplines as well.
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> *https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C8ad63c8d02e04bd681a208dd4502fde3%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638742601354965905%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=DJ%2Fl9ABlxi%2BjR%2B8C9PDqzDpGS5vkWcUnOZy6OWubBuI%3D&reserved=0
> <https://urldefense.com/v3/__https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp__;!!K-Hz7m0Vt54!n808WEyerg73DrbZDyQESV5PC1vC84crj91SbL3oDVSU7HcSL0lUF3tQ1ROBNiET4HUvJwYJ4wRkAfUoePWruLtqFnxepTRJ0w$>*
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
>
> https://urldefense.com/v3/__https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp__;!!K-Hz7m0Vt54!n808WEyerg73DrbZDyQESV5PC1vC84crj91SbL3oDVSU7HcSL0lUF3tQ1ROBNiET4HUvJwYJ4wRkAfUoePWruLtqFnxepTRJ0w$
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20250204/79b9823a/attachment.htm>


More information about the Lingtyp mailing list