<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Hosein,<br>
    <br>
    Its not entirely clear to me, what kind of corpus structure you look
    for. You can read about downloading different portions (or its
    entirety) of different language Wikipedias here:
    <a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Wikipedia:Database_download">http://en.wikipedia.org/wiki/Wikipedia:Database_download</a><br>
    <br>
    Since they have also the interwiki language links separately, I am
    sure it is possible and quite straightforward to compose a corpus
    with a structure of your own liking.<br>
    <br>
    A small notice thought. Since all texts of Wikipedia are licensed
    with copyleft licenses, then any derived corpus must also be with
    same kind of copyleft license. I personally find this the right way
    to forward science.<br>
    <br>
    Best wishes<br>
    Kristian K<br>
    <br>
    <div class="moz-cite-prefix">15.06.2014 09:07, hosein azarbonyad
      kirjutas:<br>
    </div>
    <blockquote
      cite="mid:1402812423.39085.YahooMailNeo@web141004.mail.bf1.yahoo.com"
      type="cite">
      <div style="color:#000; background-color:#fff;
        font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial,
        Lucida Grande, sans-serif;font-size:10pt">
        <div><span>As I recall there are so many papers used Wikipedia
            articles as comparable corpora in CLIR. Because in CLIR
            there is no need to have documents that are exact
            translations of each other. For our task, a collection of
            topically related aligned documents is enough. However, I
            couldn't find any free comparable corpus which is extracted
            from Wikipedia. Is there any free corpus extracted from
            Wikipedia? I know there are some comparable corpora in CLEF
            datasets but they aren't free. <br>
          </span></div>
        <div> </div>
        <div>Best Regards,<br>
          Hosein Azarbonyad</div>
        <div class="qtdSeparateBR"><br>
          <br>
        </div>
        <div style="display: block;" class="yahoo_quoted">
          <div style="font-family: HelveticaNeue, Helvetica Neue,
            Helvetica, Arial, Lucida Grande, sans-serif; font-size:
            10pt;">
            <div style="font-family: HelveticaNeue, Helvetica Neue,
              Helvetica, Arial, Lucida Grande, sans-serif; font-size:
              12pt;">
              <div dir="ltr"> <font face="Arial" size="2"> On Sunday,
                  June 15, 2014 8:42 AM, Joel Nothman
                  <a class="moz-txt-link-rfc2396E" href="mailto:joel@it.usyd.edu.au"><joel@it.usyd.edu.au></a> wrote:<br>
                </font> </div>
              <br>
              <br>
              <div class="y_msg_container">
                <div id="yiv2777319177">
                  <div>
                    <div dir="ltr">Perhaps the most characteristic
                      feature of Wikipedia is its long tail, and the
                      apparently different features (and editorial
                      behaviour?) of the tail and head. What is true of
                      the most important/popular articles may rarely be
                      true of the majority (it's unclear which we care
                      about in this case). For example, <a
                        moz-do-not-send="true" rel="nofollow"
                        shape="rect" target="_blank"
                        href="http://downloads.schwa.org/pubs/pdf/aij10wikiner.pdf">our
                        work in entity type classification</a> has
                      compared training and testing on a random or a
                      "popular" sample, each of about 2000 articles
                      altogether. A random model achieves 92% F1 over
                      popular articles, but the reverse only yields 75%,
                      although random can learn random to 90% F1. This
                      is mostly indicative of type distributions, but no
                      doubt editing patterns face similar discrepancies.
                      <div>
                        <br clear="none">
                      </div>
                      <div>Therefore I might guess that the more
                        universally popular articles like [[Tennis]] are
                        going to appear different, while the plethora of
                        more minor entries (e.g. bands, corporations)
                        are likely to have clearer parallels.
                        <div>
                          <br clear="none">
                        </div>
                        <div>Additionally, there will be divergence
                          after translation (notably restructuring in
                          the most popular articles of actively edited
                          Wikipedias) which makes cognates (may I?) hard
                          to identify from the current pages. Thus
                          "clicking a random sample of languages for the
                          page on tennis" may be made more precise if
                          one compares a foundational edit, or perhaps
                          the historical edit that introduced the
                          largest portion of text to a page, to the
                          state of the English Wikipedia equivalent <i>at
                            that time</i>. However, the example of <a
                            moz-do-not-send="true" rel="nofollow"
                            shape="rect" target="_blank"
href="http://ja.wikipedia.org/w/index.php?title=%E3%83%86%E3%83%8B%E3%82%B9&oldid=356563">Japanese
                            tennis</a> in 2004 compared to <a
                            moz-do-not-send="true" rel="nofollow"
                            shape="rect" target="_blank"
href="http://en.wikipedia.org/w/index.php?title=Tennis&oldid=2702021">English</a>
                          is not very suggestive.
                          <div>
                            <br clear="none">
                          </div>
                          <div>And I recall <a moz-do-not-send="true"
                              rel="nofollow" shape="rect"
                              target="_blank"
                              href="http://storm.cis.fordham.edu/%7Efilatova/publications.html">Elena
                              Filatova</a> did some pioneering work in
                            computationally exploiting parallels and
                            differences in multilingual Wikipedia.<br
                              clear="none">
                          </div>
                        </div>
                        <div><br clear="none">
                        </div>
                        <div>Cheers,</div>
                        <div><br clear="none">
                        </div>
                        <div class="yiv2777319177gmail_extra">Joel
                          Nothman</div>
                        <div class="yiv2777319177gmail_extra">School of
                          IT</div>
                        <div class="yiv2777319177gmail_extra">University
                          of Sydney<br clear="none">
                          <br clear="none">
                          <div class="yiv2777319177yqt5046587770"
                            id="yiv2777319177yqt34271">
                            <div class="yiv2777319177gmail_quote">
                              On 15 June 2014 11:58, Francis Bond <span
                                dir="ltr"><<a moz-do-not-send="true"
                                  rel="nofollow" shape="rect"
                                  ymailto="mailto:bond@ieee.org"
                                  target="_blank"
                                  href="mailto:bond@ieee.org">bond@ieee.org</a>></span>
                              wrote:<br clear="none">
                              <blockquote
                                class="yiv2777319177gmail_quote"
                                style="margin:0 0 0 .8ex;border-left:1px
                                #ccc solid;padding-left:1ex;">
                                G'day.<br clear="none">
                                <div class="yiv2777319177"><br
                                    clear="none">
                                  > No, articles from Wikipedia in
                                  different languages are NOT a
                                  comparable<br clear="none">
                                  > corpus, for many reasons<br
                                    clear="none">
                                  ><br clear="none">
                                </div>
                                <div class="yiv2777319177">> First,
                                  most of the time they are a (more or
                                  less free) translation of a<br
                                    clear="none">
                                  > master/initial one.<br
                                    clear="none">
                                  <br clear="none">
                                </div>
                                Do you have a citation for this?   As
                                far as I know it is not<br clear="none">
                                generally true, pages are written pretty
                                much entirely independently<br
                                  clear="none">
                                (at least for the English and Japanese
                                Wikipedias which I am<br clear="none">
                                experienced with).  I also clicked a
                                random sample of languages for<br
                                  clear="none">
                                the page on tennis, and they are all
                                very differently structured.<br
                                  clear="none">
                                <br clear="none">
                                I seem to recall a shared task on
                                aligning sentences in wikipedia<br
                                  clear="none">
                                articles that found them not at all
                                similar, but I am afraid I can't<br
                                  clear="none">
                                find the paper: does anyone else recall
                                it?<br clear="none">
                                <span class="yiv2777319177HOEnZb"><font
                                    color="#888888"><br clear="none">
                                    --<br clear="none">
                                    Francis Bond <<a
                                      moz-do-not-send="true"
                                      rel="nofollow" shape="rect"
                                      target="_blank"
                                      href="http://www3.ntu.edu.sg/home/fcbond/">http://www3.ntu.edu.sg/home/fcbond/</a>><br
                                      clear="none">
                                    Division of Linguistics and
                                    Multilingual Studies<br clear="none">
                                    Nanyang Technological University<br
                                      clear="none">
                                  </font></span>
                                <div class="yiv2777319177HOEnZb">
                                  <div class="yiv2777319177h5"><br
                                      clear="none">
_______________________________________________<br clear="none">
                                    UNSUBSCRIBE from this page: <a
                                      moz-do-not-send="true"
                                      rel="nofollow" shape="rect"
                                      target="_blank"
                                      href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a><br
                                      clear="none">
                                    Corpora mailing list<br clear="none">
                                    <a moz-do-not-send="true"
                                      rel="nofollow" shape="rect"
                                      ymailto="mailto:Corpora@uib.no"
                                      target="_blank"
                                      href="mailto:Corpora@uib.no">Corpora@uib.no</a><br
                                      clear="none">
                                    <a moz-do-not-send="true"
                                      rel="nofollow" shape="rect"
                                      target="_blank"
                                      href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a><br
                                      clear="none">
                                  </div>
                                </div>
                              </blockquote>
                            </div>
                          </div>
                          <br clear="none">
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
                <br>
                <div class="yqt5046587770" id="yqt82405">_______________________________________________<br
                    clear="none">
                  UNSUBSCRIBE from this page: <a moz-do-not-send="true"
                    shape="rect"
                    href="http://mailman.uib.no/options/corpora"
                    target="_blank">http://mailman.uib.no/options/corpora</a><br
                    clear="none">
                  Corpora mailing list<br clear="none">
                  <a moz-do-not-send="true" shape="rect"
                    ymailto="mailto:Corpora@uib.no"
                    href="mailto:Corpora@uib.no">Corpora@uib.no</a><br
                    clear="none">
                  <a moz-do-not-send="true" shape="rect"
                    href="http://mailman.uib.no/listinfo/corpora"
                    target="_blank">http://mailman.uib.no/listinfo/corpora</a></div>
                <br>
                <br>
              </div>
            </div>
          </div>
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>