[Corpora-List] Query about the (dual) language of web pages

Kit Chun Yu ctckit at cityu.edu.hk
Thu Oct 11 09:57:43 UTC 2007


Dear Yorick,

An MPhil student of mine is currently working on automatic Web page 
pairing for bitext mining via automatic URL pairing pattern discovery. 
It relies on no pre-defined pattern or any text content/structure 
comparison but only on a best-first search for an optimal set of 
patterns/keys within the URL strings from a Web site in terms of their 
linking power (= the number of possible Web page pairs they can paired 
up).  That is, it works on URL strings only (+language identification, 
of course). It is so simple that everyone may try, I think. It is so 
simple that I've got a piece of comment from a conference reviewer that 
it is too simple for publication, although simplicity = beauty, in a 
sense. -:)  Our experiments show that it can achieve an F-score of 96.4% 
on web page pairing for HK bilingual Web sites. I am not sure if this 
simple technique could help a bit to get an estimation of the figures 
you are interested.

Chunyu Kit and Jessica Y. H.  Ng. 2007. An intelligent Web agent to mine 
bilingual parallel pages via automatic discovery of URL pairing patterns 
<http://personal.cityu.edu.hk/%7Ectckit/papers/Kit-Ng_URLpairing-PID483174.pdf>. 
To appear in  the Agents and Data Mining Interaction Workshop  
(ADMI-07),  Silicon Valley, California,  November 2-5, 2007. (But please 
mind a flaw in the formulation part: the search space should be UxU' 
(not U), similarly to TxT' for possible token pairs that we fortunately 
got it right -:)

Best wishes,
Chunyu



Yorick Wilks wrote:

> Everyone is aware that some languages/cultures (e.g. Swedish, Finnish) 
> tend to have alternative webpages in English, while others (e.g. 
> Arabic) are much less likely to.
> Does anyone have any reliable figures as to the frequency of 
> appearance   of these parallel-corpora  (in English)for different 
> (source) languages? I am interested at the moment in :
> Japanese, Chinese, Korean, Spanish, Portuguese, French, German, 
> Italian, Arabic
>
>  I would be grateful for any help.
> Regards
> Yorick Wilks
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
>  
>


-- 
Chunyu Kit, PhD
Assistant Professor in Computational Linguistics

Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon

E-mail:ctckit at cityu.edu.hk
http://personal.cityu.edu.hk/~ctckit/
Fax: (+852)2788 8706, 2788 8732
Tel: (+852)2788 9310 (O), 9380 1738 (M)
     (+86)136 5881 2972 (China Mobile)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071011/83065d56/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list