[Corpora-List] Query about the (dual) language of web pages
Jose Joao Dias de Almeida
jj at di.uminho.pt
Mon Oct 15 16:09:08 UTC 2007
Dear corpora-list friends,
Santos Diana wrote:
> Yorick,
> ...
> I also seem to remember that some people in Braga (North of Portugal) did some experiments with grabbing paralell English-Portuguese pages from the Web, but since I know they are on the corpora-list they may answer directly :-)
Sorry to take so long to answer...
we have done some work in extracting parallel corpora from the web but
we can not answer your questions:
All the methods we used, get only a (small) percentage of the
available bitexts (and we decided to reject many of them).
We programmed several strategies (parguess):
1) starting from a huge URL-list (for a big number of translations
you would chose a coherent file naming system. (index-pt.html
index-en.html; /en/f.pdf /pt/f.pdf, etc)
2) starting from a set of files (looking for pair that point each
other with English/Portuguese links or other topological approaches [see
Resniq strand])
3) starting from the result of a topological query [see Resniq strand]
4) several other specific file type strategies (subtitles, po, etc)
Using method 1) based on analysis of huge URL lists we got a very
big list (much more than what we could process) of candidate pairs but
when we analyzed them we had to reject many because the translations
were too bad (many units missing, sometimes we got automatic translation
files, partial translations, etc)
We learned that in order to have good parallel corpora it is crucial
to chose the sources. (sometimes use method 1 to have source candidates;
after that, chose the sources and analyze them as much as possible)
See some details in:
(http://alfarrabio.di.uminho.pt/~albie/publications/parguess.sepln.pdf)
and
(http://alfarrabio.di.uminho.pt/~albie/publications/APL2k2.Parguess.pdf
this one in portuguese)
Um abraço from Braga,
JJoao
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list