[Corpora-List] Query about the (dual) language  of web pages
    lec3jrw at leeds.ac.uk 
    lec3jrw at leeds.ac.uk
       
    Mon Oct 15 18:32:37 UTC 2007
    
    
  
Hello,
You say that you had to discard many of the pairs because of "bad" translations.
Would it nonetheless be fair to say that, using your technique, the majority of
the page pairs returned did represent dual language page pairs (i.e. attempted
translations, however bad)? Or were there too many "red herrings"?
Kind regards,
Justin Washtell
Quoting Jose Joao Dias de Almeida <jj at di.uminho.pt> on Mon 15 Oct 2007 05:09:08
PM BST:
> Dear corpora-list friends,
>
> Santos Diana wrote:
> > Yorick,
> > ...
> > I also seem to remember that some people in Braga (North of Portugal) did
> some experiments with grabbing paralell English-Portuguese pages from the
> Web, but since I know they are on the corpora-list they may answer directly
> :-)
> Sorry to take so long to answer...
>
> we have done some work in extracting parallel corpora from the web but
> we can not answer your questions:
>      All the methods we used, get only a (small) percentage of the
> available bitexts (and we decided to reject many of them).
>
>     We programmed several strategies (parguess):
>       1) starting from a huge URL-list (for a big number of translations
> you would chose a coherent file naming system. (index-pt.html
> index-en.html; /en/f.pdf /pt/f.pdf, etc)
>       2) starting from a set of files (looking for pair that point each
> other with English/Portuguese links or other topological approaches [see
> Resniq strand])
>       3) starting from the result of a topological query [see Resniq strand]
>       4) several other specific file type strategies (subtitles, po, etc)
>
>
>     Using method 1) based on analysis of huge URL lists we got a very
> big list (much more than what we could process) of candidate pairs but
> when we analyzed them we had to reject many because the translations
> were too bad (many units missing, sometimes we got automatic translation
> files, partial translations, etc)
>
>
>     We learned that in order to have good parallel corpora it is crucial
> to chose the sources. (sometimes use method 1 to have source candidates;
> after that, chose the sources and analyze them as much as possible)
>
> See some details in:
>   (http://alfarrabio.di.uminho.pt/~albie/publications/parguess.sepln.pdf)
> and
>
> (http://alfarrabio.di.uminho.pt/~albie/publications/APL2k2.Parguess.pdf
>    this one in portuguese)
>
> Um abraço from Braga,
> JJoao
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
    
    
More information about the Corpora
mailing list