[Corpora-List] Query about the (dual) language of web pages

Jose Joao Dias de Almeida jj at di.uminho.pt
Mon Oct 15 16:09:08 UTC 2007


Dear corpora-list friends,

Santos Diana wrote:
> Yorick, 
> ...
> I also seem to remember that some people in Braga (North of Portugal) did some experiments with grabbing paralell English-Portuguese pages from the Web, but since I know they are on the corpora-list they may answer directly :-)
Sorry to take so long to answer...

we have done some work in extracting parallel corpora from the web but 
we can not answer your questions:
     All the methods we used, get only a (small) percentage of the 
available bitexts (and we decided to reject many of them).

    We programmed several strategies (parguess):
      1) starting from a huge URL-list (for a big number of translations 
you would chose a coherent file naming system. (index-pt.html 
index-en.html; /en/f.pdf /pt/f.pdf, etc)
      2) starting from a set of files (looking for pair that point each 
other with English/Portuguese links or other topological approaches [see 
Resniq strand])
      3) starting from the result of a topological query [see Resniq strand]
      4) several other specific file type strategies (subtitles, po, etc)


    Using method 1) based on analysis of huge URL lists we got a very 
big list (much more than what we could process) of candidate pairs but 
when we analyzed them we had to reject many because the translations 
were too bad (many units missing, sometimes we got automatic translation 
files, partial translations, etc)


    We learned that in order to have good parallel corpora it is crucial 
to chose the sources. (sometimes use method 1 to have source candidates;
after that, chose the sources and analyze them as much as possible)

See some details in:
  (http://alfarrabio.di.uminho.pt/~albie/publications/parguess.sepln.pdf)
and
 
(http://alfarrabio.di.uminho.pt/~albie/publications/APL2k2.Parguess.pdf 
   this one in portuguese)

Um abraço from Braga,
JJoao

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list