<div dir="ltr"><div><div><div><div><div><div><div>Hi Darren<br>No, articles from Wikipedia in different languages are NOT a comparable corpus, for many reasons.<br><br></div>First, most of the time they are a (more or less free) translation of a master/initial one.<br>
</div>Second, they are about the same (narrow) subject, while a comparable corpus would be about the same theme but different many subjects. Examples of comparable corpora would be: original articles in two languages about violations of human rights; or about fashion, or about complaints about health system facilities.<br>
<br></div>If you are interested in CLIR you could try the CLEF collections which were precisely created for this.<br><br></div>Second, a parallel corpus is not defined in terms of SENTENCE alignment, unit is a parameter for parallel. So a Wikipedia collection as the one you suggest is a parallel corpus where the unit is the wikipedia article, not the sentence.<br>
<br></div>Paralell means in a nutshell that you can put the units in direct corespondence (most of them), while comparable means that the selection criteria are the same, but you cannot pair the elements of the two coprora.<br>
<br></div>I hope to have helped.<br>Best<br></div>Diana<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">2014-06-14 16:15 GMT+02:00 Darren Cook <span dir="ltr"><<a href="mailto:darren@dcook.org" target="_blank">darren@dcook.org</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">> I'm working on Cross Language Information Retrieval based on<br>
> comparable corpora. In order to test my approach, I need a free<br>
> comparable corpus between English language and an European language.<br>
<br>
I was just trying to understand the difference between "parallel corpus"<br>
and "comparable corpus". Am I correct in thinking that if an article is<br>
translated (by a professional human translator, or a machine) from one<br>
language to another, such that there is a sentence-level correspondence,<br>
then it is a parallel corpus. Whereas a comparable corpus is one where<br>
the two articles were written about the same subject, but neither is a<br>
translation of the other, and mostly the same knowledge is covered, but<br>
a sentence-level mapping would not exist?<br>
<br>
If so, Wikipedia sounds like an ideal source.<br>
E.g.<br>
<a href="http://en.wikipedia.org/wiki/Paris" target="_blank">http://en.wikipedia.org/wiki/Paris</a><br>
<a href="http://fr.wikipedia.org/wiki/Paris" target="_blank">http://fr.wikipedia.org/wiki/Paris</a><br>
<br>
<a href="http://en.wikipedia.org/wiki/Association_football" target="_blank">http://en.wikipedia.org/wiki/Association_football</a><br>
<a href="http://fr.wikipedia.org/wiki/Football" target="_blank">http://fr.wikipedia.org/wiki/Football</a><br>
<br>
etc.<br>
<span class="HOEnZb"><font color="#888888"><br>
Darren<br>
<br>
<br>
--<br>
Darren Cook, Software Researcher/Developer<br>
My new book: Data Push Apps with HTML5 SSE<br>
Published by O'Reilly: (ask me for a discount code!)<br>
<a href="http://shop.oreilly.com/product/0636920030928.do" target="_blank">http://shop.oreilly.com/product/0636920030928.do</a><br>
Also on Amazon and at all good booksellers!<br>
<br>
_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</font></span></blockquote></div><br></div>