37.843, FYI: CLASSLA-web 2.0 Web Corpora for South Slavic Languages

The LINGUIST List linguist at listserv.linguistlist.org
Mon Mar 2 20:05:02 UTC 2026


LINGUIST List: Vol-37-843. Mon Mar 02 2026. ISSN: 1069 - 4875.

Subject: 37.843, FYI: CLASSLA-web 2.0 Web Corpora for South Slavic Languages

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================


Date: 02-Mar-2026
From: Taja Kuzman Pungeršek [taja.kuzman at ijs.si]
Subject: CLASSLA-web 2.0 Web Corpora for South Slavic Languages


We are happy to announce that we have released the second version of
the South Slavic CLASSLA-web corpora. The corpus collection contains
approximately 38 million texts and 17 billion words, collected from
the web in 2024, and covers the full South Slavic language group:
Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and
Slovenian. Compared to CLASSLA-web 1.0, the new web corpora are
significantly expanded and largely consist of new texts. The corpora
are linguistically annotated, automatically classified by genre and
enriched with topic labels.
The web corpus collection is intended for a wide range of uses,
including corpus linguistics, lexicography, and other linguistic
research, as well as for natural language processing tasks such as
training and evaluating language models, and creating genre- or
topic-specific datasets.
A detailed description of the resource can be found in the
accompanying paper (https://doi.org/10.48550/arXiv.2601.11170).
Further information on both CLASSLA-web 1.0 and 2.0 versions,
including details on corpus construction, additional resources, a
video describing the workflow, and citation guidelines, is available
on the CLASSLA-web website: https://clarinsi.github.io/classla-web/
If you are interested in language resources and technologies for South
Slavic languages, we invite you to browse the CLASSLA-web corpora via
the CLARIN.SI concordancers (https://www.clarin.si/ske/#open) or
download them under a CC0 license from the CLARIN.SI repository:
http://hdl.handle.net/11356/2079
Best wishes,
CLASSLA-web authors: Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel
and Nikola Ljubešić, supported by CLARIN.SI
(https://www.clarin.si/info/about/) and CLASSLA
(https://www.clarin.si/info/k-centre/)

Linguistic Field(s): Applied Linguistics
                     Computational Linguistics
                     Text/Corpus Linguistics

Subject Language(s): Bulgarian (bul)
                     Croatian (hrv)
                     Macedonian (mkd)
                     Serbian (srp)
                     Slovenian (slv)

Language Family(ies): Slavic
                      South Slavic



------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

SIL International Publications http://www.sil.org/resources/publications


----------------------------------------------------------
LINGUIST List: Vol-37-843
----------------------------------------------------------



More information about the LINGUIST mailing list