[Lingtyp] Concerns about U.S. policies and linguistic archives
Heaton, Raina
rainaheaton at ou.edu
Tue Feb 4 16:06:58 UTC 2025
Hello everyone,
First, thank you for your concern. I can't speak to what might be going on with US federal repositories but at least for the archive I manage (a regional language archive, in a red state no less), we are not worried at present about data being taken or lost; we have many many backups and legal agreements with our depositors and tribes that would make it very hard for the data to go anywhere. What we are watching closely are the attacks on the federal funding that supports language-related activities, from data collection to archiving to archival infrastructure to digitization to our student workers. It may make it significantly harder for everyone to operate here, if they are successful this time.
All the best,
Raina
[signature_3681818697]
Dr. Raina Heaton
Presidential Associate Professor, Native American Studies
Associate Curator, Native American Languages
Director, Oklahoma Native American Youth Language Fair
Sam Noble Museum
University of Oklahoma
Celebrating 125 years of exploration and education
2401 Chautauqua Ave. Norman, OK 73072-7029
[A black background with a black square Description automatically generated with medium confidence]<https://urldefense.com/v3/__https:/www.facebook.com/SamNobleMuseum/__;!!GNU8KkXDZlD12Q!6D0c36d4mZX5zCpzGgGrfTlaqqFBQZSQAPsqzRPcA4HFyunhcSKKQjCG-mgG45JASttvOtuV82tQ3uc7D5QrTX1tOQ$>[A black background with a black square Description automatically generated with medium confidence]<https://urldefense.com/v3/__https:/x.com/SamNobleMuseum__;!!GNU8KkXDZlD12Q!6D0c36d4mZX5zCpzGgGrfTlaqqFBQZSQAPsqzRPcA4HFyunhcSKKQjCG-mgG45JASttvOtuV82tQ3uc7D5SU7ZzyKA$> [A white logo on a black background Description automatically generated] <https://urldefense.com/v3/__https:/www.instagram.com/samnoblemuseum/__;!!GNU8KkXDZlD12Q!6D0c36d4mZX5zCpzGgGrfTlaqqFBQZSQAPsqzRPcA4HFyunhcSKKQjCG-mgG45JASttvOtuV82tQ3uc7D5Sejpn8CQ$> [A red circle with a play button Description automatically generated] <https://urldefense.com/v3/__https:/www.youtube.com/channel/UC55PW-tEYOJLyuEg3_Wka6Q__;!!GNU8KkXDZlD12Q!6D0c36d4mZX5zCpzGgGrfTlaqqFBQZSQAPsqzRPcA4HFyunhcSKKQjCG-mgG45JASttvOtuV82tQ3uc7D5SFSOQ23A$>
________________________________
From: Lingtyp <lingtyp-bounces at listserv.linguistlist.org> on behalf of Emily M. Bender via Lingtyp <lingtyp at listserv.linguistlist.org>
Sent: Tuesday, February 4, 2025 9:53 AM
To: Stela Manova <manova.stela at gmail.com>
Cc: lingtyp at listserv.linguistlist.org <lingtyp at listserv.linguistlist.org>
Subject: Re: [Lingtyp] Concerns about U.S. policies and linguistic archives
Please do not appeal to LLMs as any source of security here.
1) If that info is in Common Crawl, then yes, it might be retrievable. But that is only a snapshot and I don't know how well organized that dataset is.
2) If you're talking about using LLM output to "recreate" the data, you're proposing adding enormous amounts of noise. This is the opposite of good data handling practices.
3) Many archives rightly practice various kinds of access restrictions. If these restrictions are implemented well, the data meant to be accessible to community members and/or community-approved researchers should not have been scraped.
It is my understanding as well that only federal data sources are immediately in danger, but not all US institutions are heeding the advice "do not obey in advance".
Emily
On Tue, Feb 4, 2025 at 7:48 AM Stela Manova via Lingtyp <lingtyp at listserv.linguistlist.org<mailto:lingtyp at listserv.linguistlist.org>> wrote:
It seems to me that the problem should not be approached with a traditional logic. For example, everything freely available on the web can be seen as already archived, more or less, because it has been used as training data for Large Language Models.
Best,
Stela
________________________________
From: Lingtyp on behalf of Juergen Bohnemeyer via Lingtyp
Sent: Tuesday, February 4, 2025 3:33 PM
To: Jocelyn Aznar; lingtyp at listserv.linguistlist.org<mailto:lingtyp at listserv.linguistlist.org>
Subject: Re: [Lingtyp] Concerns about U.S. policies and linguistic archives
Dear Jocelyn – Indeed, we are once again finding ourselves in “interesting”, read unprecedented and disturbing, times. Now, I may not be in the best position to respond to your query, but any immediate concern for the safety of language archives would only relate to things that are under the control of the federal government, such as the Library of Congress or the National Endowment for the Humanities. And as far as I know, these have not been archiving data and records from endangered languages.
I do, however, worry about the Smithsonian Institution in this regard. Other than the Smithsonian, the language archive that comes immediately to mind, AILLA at UT, is not under the purview of the federal government.
In any event, beyond the current situation, it seems indeed vitally important to connect the world’s digital language archives and create a system of mirrors in order to effectively decentralize the data and thereby make it less vulnerable to threats on any one site or even country. It’s my understanding that the people in charge of the archives are well aware of this and have begun to take steps. But it’s a long-haul project, based on my very incomplete understanding.
Best – Juergen
Juergen Bohnemeyer (He/Him)
Professor, Department of Linguistics
University at Buffalo
Office: 642 Baldy Hall, UB North Campus
Mailing address: 609 Baldy Hall, Buffalo, NY 14260
Phone: (716) 645 0127
Fax: (716) 645 3825
Email: jb77 at buffalo.edu<mailto:jb77 at buffalo.edu>
Web: http://www.acsu.buffalo.edu/~jb77/<https://urldefense.com/v3/__http://www.acsu.buffalo.edu/*jb77/__;fg!!K-Hz7m0Vt54!n808WEyerg73DrbZDyQESV5PC1vC84crj91SbL3oDVSU7HcSL0lUF3tQ1ROBNiET4HUvJwYJ4wRkAfUoePWruLtqFnzbvNxdBw$>
Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh)
There’s A Crack In Everything - That’s How The Light Gets In
(Leonard Cohen)
--
From: Lingtyp <lingtyp-bounces at listserv.linguistlist.org<mailto:lingtyp-bounces at listserv.linguistlist.org>> on behalf of Jocelyn Aznar via Lingtyp <lingtyp at listserv.linguistlist.org<mailto:lingtyp at listserv.linguistlist.org>>
Date: Tuesday, February 4, 2025 at 05:02
To: lingtyp at listserv.linguistlist.org<mailto:lingtyp at listserv.linguistlist.org> <lingtyp at listserv.linguistlist.org<mailto:lingtyp at listserv.linguistlist.org>>
Subject: [Lingtyp] Concerns about U.S. policies and linguistic archives
Dear colleagues,
I know this list is primarily meant for discussing ideas and
observations related to linguistic typology, rather than politics.
However, current U.S. policies regarding scientific data have led me to
wonder whether these policies might affect the fields of linguistics and
humanities.
When I heard about data related to ecology and the environment being
discarded, I immediately worried the same could happen to linguistic
archives and datasets. But maybe it is just me, dear colleagues working
in the US, what do you think? Could this happen as well to archives
related to linguistics and humanities?
I believe that if we address this issue proactively, we’ll be better
placed to preserve more data should the need arise. For instance, we
could check whether the existing infrastructure outside of the US, ELAR,
HumaNum/Ortolang, Pangloss, Paradisec, etc, would be able to handle or
help to face such a crisis? or whether we should consider setting up
some sort of emergency server so that researchers can transfer data at
risk of being lost?
One possible strategy would be to prepare a brief manual (probably as a
webpage), after discussing with each institution of course, describing
which archives outside the U.S. could accept data from an archive from
the US, in which format, what kind of data would be accepted, etc. Then,
if needed, U.S based researchers could formulate a plan to safeguard
their data. By doing that, we could also identify gaps in current
coverage and, if necessary, establish an emergency archive or server to
fill those gaps.
Best regards,
Jocelyn Aznar
¹ I’m of course also concerned about data from other fields, though I
feel more competent discussing linguistic data. Still, if we build an
infrastructure for linguistic data from the U.S., it might be possible
to scale it up for other disciplines as well.
_______________________________________________
Lingtyp mailing list
Lingtyp at listserv.linguistlist.org<mailto:Lingtyp at listserv.linguistlist.org>
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C8ad63c8d02e04bd681a208dd4502fde3%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638742601354965905%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=DJ%2Fl9ABlxi%2BjR%2B8C9PDqzDpGS5vkWcUnOZy6OWubBuI%3D&reserved=0<https://urldefense.com/v3/__https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp__;!!K-Hz7m0Vt54!n808WEyerg73DrbZDyQESV5PC1vC84crj91SbL3oDVSU7HcSL0lUF3tQ1ROBNiET4HUvJwYJ4wRkAfUoePWruLtqFnxepTRJ0w$>
_______________________________________________
Lingtyp mailing list
Lingtyp at listserv.linguistlist.org<mailto:Lingtyp at listserv.linguistlist.org>
https://urldefense.com/v3/__https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp__;!!K-Hz7m0Vt54!n808WEyerg73DrbZDyQESV5PC1vC84crj91SbL3oDVSU7HcSL0lUF3tQ1ROBNiET4HUvJwYJ4wRkAfUoePWruLtqFnxepTRJ0w$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20250204/2536d404/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-signature_.png
Type: image/png
Size: 40275 bytes
Desc: Outlook-signature_.png
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20250204/2536d404/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-A black ba.png
Type: image/png
Size: 8499 bytes
Desc: Outlook-A black ba.png
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20250204/2536d404/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-A black ba.png
Type: image/png
Size: 11666 bytes
Desc: Outlook-A black ba.png
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20250204/2536d404/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-A white lo.png
Type: image/png
Size: 19327 bytes
Desc: Outlook-A white lo.png
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20250204/2536d404/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Outlook-A red circ.png
Type: image/png
Size: 29872 bytes
Desc: Outlook-A red circ.png
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20250204/2536d404/attachment-0009.png>
More information about the Lingtyp
mailing list