[Corpora-List] Seeking for a free comparable corpus

Motaz Saad motaz.saad at inria.fr
Tue Jun 17 08:06:43 UTC 2014


Hi, 


maybe you find your interest at https://sites.google.com/site/motazsite/arabic/comparable-corpora 

It is free comparable corpora collected from Wikipedia and Euro-News in English, French, and Arabic languages.

best regards,
Motaz 

----- Original Message -----
| From: "Frank Binder" <Frank.Binder at germanistik.uni-giessen.de>
| To: "hosein azarbonyad" <hosein.azarbonyad at yahoo.com>, corpora at uib.no
| Sent: Tuesday, June 17, 2014 9:20:49 AM
| Subject: Re: [Corpora-List] Seeking for a free comparable corpus
| 
| Dear Hosein,
| 
| beware, though, that the LCC contains corpora of sentences, not of
| documents. So they might be useful only for certain kinds of tasks.
| 
| Regards
| Frank
| 
| 
| Am 17.06.2014 09:08, schrieb Frank Binder:
| > Dear Hosein,
| >
| > "comparable" corpora for different languages are available from the
| > Leipzig Corpora Collection (LCC):
| >
| >> The Leipzig Corpora Collection presents corpora in different languages
| >> using the same format and comparable sources. The corpora are ready to
| >> use with the Corpus Browser. Moreover, all data are available as plain
| >> text and as MySQL database tables for various applications. [...]
| >> The corpora are identical in format and similar in size and content.
| >> They contain randomly selected sentences in the language of the corpus
| >> and are available in sizes of 100,000 sentences, 300,000 sentences, 1
| >> million sentences etc.. The sources are either newspaper texts or
| >> texts randomly collected from the web. The texts are split into
| >> sentences. Non-sentences and foreign language material was removed. [...]
| >
| > http://corpora.informatik.uni-leipzig.de/download.html
| >
| > As far as I can tell, the underlying "definition" of "comparable" is
| > roughly equal to Diana's statement
| >
| > Am 14.06.2014 16:31, schrieb Diana Santos:
| >> [...] Paralell means in a nutshell that you can put the units in
| >> direct corespondence (most of them), while comparable means that the
| >> selection criteria are the same, but you cannot pair the elements of
| >> the two coprora.  [...]
| >
| > Note that the LCC is also presented in one of the chapters of the BUCC
| > book that Reinhard just referred to:
| >
| >> Thomas Eckart und Uwe Quasthoff: Statistical Corpus and Language
| >> Comparison on Comparable Corpora. In: BUCC – Building and Using
| >> Comparable Corpora, Springer, 2013
| >
| > Am 16.06.2014 23:16, schrieb Reinhard Rapp:
| >> Let me only add that there are a number of definitions around of what
| >> constitutes a comparable corpus, and I think none of them is really
| >> authorative.
| >>
| >> E.g. another one can be found in section 2.1 of the introductory
| >> chapter of the book “Building and Using Comparable Corpora”. This
| >> chapter can be freely downloaded from
| >> http://www.springer.com/computer/ai/book/978-3-642-20127-1 by clicking
| >> on “Download sample pages”.
| >
| > Best regards
| >
| > Frank
| >
| >
| >
| >
| >
| >
| >
| >
| >
| >
| > Am 16.06.2014 09:28, schrieb hosein azarbonyad:
| >> Dear Kristian,
| >>
| >> Building a comparable corpus has many challenges and problems such as
| >> covering all domains, filtering non useful documents, concerns about
| >> high coverage of vocabulary in both languages and so on. Since my main
| >> research is not building a comparable corpus, I don't have enough
| >> knowledge and time to construct a corpus with high quality. I just want
| >> to use a comparable corpus in my research and as I said dealing with
| >> challenges of constructing a comparable corpus is something that I
| >> couldn't investigate on it.
| >> Best Regards,
| >> Hosein Azarbonyad
| >>
| >> [...]
| >>
| >> _______________________________________________
| >> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
| >> Corpora mailing list
| >> Corpora at uib.no
| >> http://mailman.uib.no/listinfo/corpora
| >>
| >
| >
| 
| 
| --
| Dipl.-Inf. Frank Binder ·  Tel.: +49(0)641 99-29056  ·  Fax: -29059
| Justus-Liebig-Universität Gießen · FB 05 · Institut für Germanistik
| Arbeitsbereich Angewandte Sprachwissenschaft und Computerlinguistik
| Philosophikum I Büro D 406 · Otto-Behaghel-Str. 10 D · 35394 Gießen
| 
| Justus Liebig University Giessen
| Applied and Computational Linguistics
| Otto-Behaghel-Str. 10 D
| 35394 Giessen, Germany
| 
| siehe auch / see also:
| http://www.uni-giessen.de/fbz/zmi/das-zmi/angehoerige/mitarbeiter-zmi/binder-frank
| 
| _______________________________________________
| UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
| Corpora mailing list
| Corpora at uib.no
| http://mailman.uib.no/listinfo/corpora
|

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list