34.2135, Software: CLASSLA web corpora of Croatian, Serbian and Slovenian

The LINGUIST List linguist at listserv.linguistlist.org
Thu Jul 6 05:05:02 UTC 2023


LINGUIST List: Vol-34-2135. Thu Jul 06 2023. ISSN: 1069 - 4875.

Subject: 34.2135, Software: CLASSLA web corpora of Croatian, Serbian and Slovenian

Moderators: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Daniel Swanson, Maria Lucero Guillen Puon, Zackary Leech, Lynzie Coburn, Natasha Singh, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: 23-Jun-2023
From: Taja Kuzman [taja.kuzman at ijs.si]
Subject: CLASSLA web corpora of Croatian, Serbian and Slovenian


The CLASSLA Knowledge centre for South Slavic languages
(https://www.clarin.si/info/k-centre/) is delighted to announce the
release of the pilot versions (v0.1) of the CLASSLA web corpora for
Croatian (2.3 billion words), Serbian (2.4 billion words) and
Slovenian (1.9 billion words). They are available for querying via the
CLARIN.SI concordancers (https://www.clarin.si/ske/#open). The main
features of the newly released corpora, aside from their large size
and recency (crawled in 2022) is their automatic enrichment with genre
information (https://huggingface.co/classla/xlm-roberta-base-multiling
ual-text-genre-classifier) and their linguistic processing with the
improved CLASSLA-Stanza annotation pipeline
(https://pypi.org/project/classla/). The pilot versions of these
corpora are intended to gather valuable user feedback, while the
official release (v1.0) of the three existing corpora, along with web
corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is
scheduled for later this year.

We warmly welcome you to explore our corpora and feel free to reach
out to us at helpdesk.classla at clarin.si with any ideas for
improvements. You are also invited to read our blog post on the use of
CLASSLA web corpora via the open CLARIN.SI concordancers: https://www.
clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-
croatian-serbian-and-slovenian-on-clarin-si-concordancers/.

If you are interested in South Slavic resources and technologies, we
also invite you to join the CLASSLA mailing list
(https://mailman.ijs.si/mailman/listinfo/classla) and to follow the
CLARIN.SI infrastructure on Twitter
(https://twitter.com/ClarinSlovenia).

Linguistic Field(s): Applied Linguistics
                     Computational Linguistics
                     Discourse Analysis
                     Language Acquisition
                     Text/Corpus Linguistics

Subject Language(s): Croatian (hrv)
                     Serbian (srp)
                     Slovenian (slv)

Language Family(ies): Sogdian-Choresmian-Bactrian
                      South Slavic



------------------------------------------------------------------------------

Please consider donating to the Linguist List https://give.myiu.org/iu-bloomington/I320011968.html


LINGUIST List is supported by the following publishers:

American Dialect Society/Duke University Press http://dukeupress.edu

Bloomsbury Publishing (formerly The Continuum International Publishing Group) http://www.bloomsbury.com/uk/

Brill http://www.brill.com

Cambridge Scholars Publishing http://www.cambridgescholars.com/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton

Dictionary Society of North America http://dictionarysociety.com/

Edinburgh University Press www.edinburghuniversitypress.com

Equinox Publishing Ltd http://www.equinoxpub.com/

European Language Resources Association (ELRA) http://www.elra.info

Georgetown University Press http://www.press.georgetown.edu

John Benjamins http://www.benjamins.com/

Lincom GmbH https://lincom-shop.eu/

Linguistic Association of Finland http://www.ling.helsinki.fi/sky/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Oxford University Press http://www.oup.com/us

SIL International Publications http://www.sil.org/resources/publications

Springer Nature http://www.springer.com

Wiley http://www.wiley.com


----------------------------------------------------------
LINGUIST List: Vol-34-2135
----------------------------------------------------------



More information about the LINGUIST mailing list