[Corpora-List] Seeking for a free comparable corpus

Frank Binder Frank.Binder at germanistik.uni-giessen.de
Tue Jun 17 07:20:49 UTC 2014


Dear Hosein,

beware, though, that the LCC contains corpora of sentences, not of 
documents. So they might be useful only for certain kinds of tasks.

Regards
Frank


Am 17.06.2014 09:08, schrieb Frank Binder:
> Dear Hosein,
>
> "comparable" corpora for different languages are available from the
> Leipzig Corpora Collection (LCC):
>
>> The Leipzig Corpora Collection presents corpora in different languages
>> using the same format and comparable sources. The corpora are ready to
>> use with the Corpus Browser. Moreover, all data are available as plain
>> text and as MySQL database tables for various applications. [...]
>> The corpora are identical in format and similar in size and content.
>> They contain randomly selected sentences in the language of the corpus
>> and are available in sizes of 100,000 sentences, 300,000 sentences, 1
>> million sentences etc.. The sources are either newspaper texts or
>> texts randomly collected from the web. The texts are split into
>> sentences. Non-sentences and foreign language material was removed. [...]
>
> http://corpora.informatik.uni-leipzig.de/download.html
>
> As far as I can tell, the underlying "definition" of "comparable" is
> roughly equal to Diana's statement
>
> Am 14.06.2014 16:31, schrieb Diana Santos:
>> [...] Paralell means in a nutshell that you can put the units in
>> direct corespondence (most of them), while comparable means that the
>> selection criteria are the same, but you cannot pair the elements of
>> the two coprora.  [...]
>
> Note that the LCC is also presented in one of the chapters of the BUCC
> book that Reinhard just referred to:
>
>> Thomas Eckart und Uwe Quasthoff: Statistical Corpus and Language
>> Comparison on Comparable Corpora. In: BUCC – Building and Using
>> Comparable Corpora, Springer, 2013
>
> Am 16.06.2014 23:16, schrieb Reinhard Rapp:
>> Let me only add that there are a number of definitions around of what
>> constitutes a comparable corpus, and I think none of them is really
>> authorative.
>>
>> E.g. another one can be found in section 2.1 of the introductory
>> chapter of the book “Building and Using Comparable Corpora”. This
>> chapter can be freely downloaded from
>> http://www.springer.com/computer/ai/book/978-3-642-20127-1 by clicking
>> on “Download sample pages”.
>
> Best regards
>
> Frank
>
>
>
>
>
>
>
>
>
>
> Am 16.06.2014 09:28, schrieb hosein azarbonyad:
>> Dear Kristian,
>>
>> Building a comparable corpus has many challenges and problems such as
>> covering all domains, filtering non useful documents, concerns about
>> high coverage of vocabulary in both languages and so on. Since my main
>> research is not building a comparable corpus, I don't have enough
>> knowledge and time to construct a corpus with high quality. I just want
>> to use a comparable corpus in my research and as I said dealing with
>> challenges of constructing a comparable corpus is something that I
>> couldn't investigate on it.
>> Best Regards,
>> Hosein Azarbonyad
>>
>> [...]
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>


-- 
Dipl.-Inf. Frank Binder ·  Tel.: +49(0)641 99-29056  ·  Fax: -29059
Justus-Liebig-Universität Gießen · FB 05 · Institut für Germanistik
Arbeitsbereich Angewandte Sprachwissenschaft und Computerlinguistik
Philosophikum I Büro D 406 · Otto-Behaghel-Str. 10 D · 35394 Gießen

Justus Liebig University Giessen
Applied and Computational Linguistics
Otto-Behaghel-Str. 10 D
35394 Giessen, Germany

siehe auch / see also:
http://www.uni-giessen.de/fbz/zmi/das-zmi/angehoerige/mitarbeiter-zmi/binder-frank

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list