[Corpora-List] Helsinki Corpus of Swahili released for academic use

Kielipankki ling at csc.fi
Mon Oct 25 09:45:58 UTC 2004


Helsinki Corpus of Swahili released

The Helsinki Corpus of Swahili (HCS) has been released and is
available at the Language Bank of Finland for academic research
purposes on an interactive Linux server and via a web interface,
WWW-Lemmie. All usage requires a personal user account.

HCS is an annotated corpus of Standard Swahili text. It contains news
texts from several current Swahili newspapers as well as from the news
site of Deutsche Welle. It also contains extracts from a number of
books containing prose text, including fiction, education and
sciences. The total size of the corpus is 12.5 million words in 25.000
XML documents. The XML format used is a derivate of TEI.

HCS has been annotated with SALAMA (Swahili Language Manager), a
multi-purpose language management environment, developed at the
University of Helsinki by Arvi Hurskainen, Professor of African
languages. The corpus contains information of such features as the
base form of the word (lemma), part-of-speech, and morphology,
including noun class affiliation and verb morphology. It also contains
the etymology of loan words and glosses in English.

For more information about the corpus (and a link to the web-based
application form), go to:

    http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en

Note that commercial use of the corpus, including the interactive
use of SALAMA, is possible, but must be negotiated separately with
Professor Hurskainen (ahurskai AT ling DOT helsinki DOT fi).


Best regards,

Mickel Grönroos and Manne Miettinen
The Language Bank of Finland
at the Finnish IT center for science CSC

Arvi Hurskainen
Professor of African languages, University of Helsinki

--
Kielipankki | Språkbanken i Finland | The Language Bank of Finland
The Finnish IT center for science CSC
PL 405 (Tekniikantie 15 a D), 02101 Espoo, Finland, +358-9-4572237



More information about the Corpora mailing list