[Corpora-List] Seeking for a free comparable corpus

Tue Jun 17 07:08:17 UTC 2014

Dear Hosein,

"comparable" corpora for different languages are available from the 
Leipzig Corpora Collection (LCC):

> The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. The corpora are ready to use with the Corpus Browser. Moreover, all data are available as plain text and as MySQL database tables for various applications. [...]
> The corpora are identical in format and similar in size and content. They contain randomly selected sentences in the language of the corpus and are available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences etc.. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed. [...]

http://corpora.informatik.uni-leipzig.de/download.html

As far as I can tell, the underlying "definition" of "comparable" is 
roughly equal to Diana's statement

Am 14.06.2014 16:31, schrieb Diana Santos:
> [...] Paralell means in a nutshell that you can put the units in direct corespondence (most of them), while comparable means that the selection criteria are the same, but you cannot pair the elements of the two coprora.  [...]

Note that the LCC is also presented in one of the chapters of the BUCC 
book that Reinhard just referred to:

> Thomas Eckart und Uwe Quasthoff: Statistical Corpus and Language Comparison on Comparable Corpora. In: BUCC – Building and Using Comparable Corpora, Springer, 2013

Am 16.06.2014 23:16, schrieb Reinhard Rapp:
> Let me only add that there are a number of definitions around of what constitutes a comparable corpus, and I think none of them is really authorative.
>
> E.g. another one can be found in section 2.1 of the introductory chapter of the book “Building and Using Comparable Corpora”. This chapter can be freely downloaded from http://www.springer.com/computer/ai/book/978-3-642-20127-1 by clicking on “Download sample pages”.

Best regards

Frank

Am 16.06.2014 09:28, schrieb hosein azarbonyad:
> Dear Kristian,
>
> Building a comparable corpus has many challenges and problems such as
> covering all domains, filtering non useful documents, concerns about
> high coverage of vocabulary in both languages and so on. Since my main
> research is not building a comparable corpus, I don't have enough
> knowledge and time to construct a corpus with high quality. I just want
> to use a comparable corpus in my research and as I said dealing with
> challenges of constructing a comparable corpus is something that I
> couldn't investigate on it.
> Best Regards,
> Hosein Azarbonyad
>
> [...]
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Dipl-Inf. Frank Binder
Justus-Liebig-Universität Gießen
Zentrum für Medien und Interaktivität
Ludwigstraße 34
D-35390 Gießen
Tel. +49 641 99-16383

http://www.uni-giessen.de/fbz/zmi/das-zmi/angehoerige/mitarbeiter-zmi/binder-frank 

-- 
Dipl.-Inf. Frank Binder ·  Tel.: +49(0)641 99-29056  ·  Fax: -29059
Justus-Liebig-Universität Gießen · FB 05 · Institut für Germanistik
Arbeitsbereich Angewandte Sprachwissenschaft und Computerlinguistik
Philosophikum I Büro D 406 · Otto-Behaghel-Str. 10 D · 35394 Gießen

Justus Liebig University Giessen
Applied and Computational Linguistics
Otto-Behaghel-Str. 10 D
35394 Giessen, Germany

siehe auch / see also:
http://www.uni-giessen.de/fbz/zmi/das-zmi/angehoerige/mitarbeiter-zmi/binder-frank

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora