[Corpora-List] the IPI PAN Corpus of Polish

Adam Przepiorkowski adamp at ipipan.waw.pl
Wed Mar 22 22:31:33 UTC 2006


The 2nd edition of the IPI PAN Corpus of Polish, developed
at the Institute of Computer Science of the Polish Academy
of Sciences (PAS), is available at the web pages of:

- the Institute of Computer Science PAS: 
  http://korpus.pl/en/
- the Institute of Polish Language PAS: 
  http://corpus.ijp-pan.krakow.pl/en/

To the best of our knowledge, this is currently the largest
searchable morphosyntactically annotated corpus of Polish
available to the public.

The whole corpus consists of over 250 million segments
(about 200 million orthographic words) and it is not
balanced, but a balanced sample of over 30 million segments
is also available.  These corpora can be directly searched
at the above addresses (do read the query syntax cheatsheet
at http://korpus.pl/en/cheatsheet/index.html) or downloaded
in a binary form to be used with a standalone version of the
corpus search engine Poliqarp (announced separately on the
'corpora' list).  Note that the standalone Poliqarp offers
much greater functionality than the web interface (e.g., it
shows metadata, presents more results, etc.).

Best regards,

Adam P.

-- 
Adam Przepiorkowski
http://nlp.ipipan.waw.pl/ ----- Linguistic Engineering Group
http://korpus.pl/ ------------- the IPI PAN Corpus of Polish



More information about the Corpora mailing list