[Corpora-List] Query about the (dual) language of web pages

Tue Oct 16 10:15:07 UTC 2007

Justin,
   I just discard them because my intention was to have "good" parallel 
corpora. Many of those are in fact dual language page pairs.

(but for our project saw that fuzzy translations were very bad for the 
final results. That is why i sad that "we can not answer your 
questions": our early decisions went in different direction.)

   Any way I believe that methods related with huge URL-list can give you
a good(?) way of comparing dual language page pairs cardinalities.
J.João

Justin Washtell said,

> You say that you had to discard many of the pairs because of "bad" translations.
> Would it nonetheless be fair to say that, using your technique, the majority of
> the page pairs returned did represent dual language page pairs (i.e. attempted
> translations, however bad)? Or were there too many "red herrings"?
> 
> Kind regards,
> 
> Justin Washtell
> 
> 
> Quoting Jose Joao Dias de Almeida <jj at di.uminho.pt> on Mon 15 Oct 2007 05:09:08
> PM BST:
> 
>> Dear corpora-list friends,
>>
>> Santos Diana wrote:
>>> Yorick,
>>> ...
>>> I also seem to remember that some people in Braga (North of Portugal) did
>> some experiments with grabbing paralell English-Portuguese pages from the
>> Web, but since I know they are on the corpora-list they may answer directly
>> :-)
>> Sorry to take so long to answer...
>>
>> we have done some work in extracting parallel corpora from the web but
>> we can not answer your questions:
>>      All the methods we used, get only a (small) percentage of the
>> available bitexts (and we decided to reject many of them).
>>
>>     We programmed several strategies (parguess):
>>       1) starting from a huge URL-list (for a big number of translations
>> you would chose a coherent file naming system. (index-pt.html
>> index-en.html; /en/f.pdf /pt/f.pdf, etc)
>>       2) starting from a set of files (looking for pair that point each
>> other with English/Portuguese links or other topological approaches [see
>> Resniq strand])
>>       3) starting from the result of a topological query [see Resniq strand]
>>       4) several other specific file type strategies (subtitles, po, etc)
>>
>>
>>     Using method 1) based on analysis of huge URL lists we got a very
>> big list (much more than what we could process) of candidate pairs but
>> when we analyzed them we had to reject many because the translations
>> were too bad (many units missing, sometimes we got automatic translation
>> files, partial translations, etc)
>>
>>
>>     We learned that in order to have good parallel corpora it is crucial
>> to chose the sources. (sometimes use method 1 to have source candidates;
>> after that, chose the sources and analyze them as much as possible)
>>
>> See some details in:
>>   (http://alfarrabio.di.uminho.pt/~albie/publications/parguess.sepln.pdf)
>> and
>>
>> (http://alfarrabio.di.uminho.pt/~albie/publications/APL2k2.Parguess.pdf
>>    this one in portuguese)
>>
>> Um abraço from Braga,
>> JJoao
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
> 
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 
> 

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora