[Corpora-List] Query about the (dual) language of web pages

Kit Chun Yu ctckit at cityu.edu.hk
Mon Oct 15 02:29:44 UTC 2007


Dear Philip,

Thanks a lot for your comments and info. You are right that the density 
would affect its performance. It is one of the issues that we are very 
interested in exploring, and the URL sets you provided will be very 
useful. I wish that its in-born mechanism to favor the more powerful 
patterns could deal with the density problem successfully to some 
extent. Currently we are trying to revive the weak patterns filtered out 
by a threshold, by inferring some more general patterns within the weak 
patterns. We are also looking into the possibility of extending this 
approach to retrieve multilingual grouping (vs. bilingual pairing) 
patterns from multilingual Web sites (e.g., many EU sites), to examine 
how the number of languages (=another kind of density?) would affect the 
performance of this approach.

Also I want to mention with gratitude that your previous work has given 
us so much input. Thanks.

Best wishes,
Chunyu


P Resnik wrote:

> Chunyu, your paper is very nice.  A key to the success of your 
> approach seems to be the fact that you experimented only on data from 
> Hong Kong government Web sites, where one could expect both a higher 
> density of parallel pages and a greater degree of conformity to URL 
> naming regularities, as compared to the Web sites in general.  The 
> language independence of your student's approach is attractive and 
> should make it easy to investigate a broader range of language pairs 
> and types of Web sites, in order to see how much of a difference that 
> makes.  Is this a direction you are exploring, or which you plan to 
> explore?  (If so, the URL sets at 
> http://umiacs.umd.edu/~resnik/strand/ 
> <http://umiacs.umd.edu/%7Eresnik/strand/> would make it easy to find 
> hosts already known to contain parallel pages for several language pairs.)
>
> Best regards,
>
>   Philip
>
>
> On 10/11/07, Kit Chun Yu <ctckit at cityu.edu.hk 
> <mailto:ctckit at cityu.edu.hk>> wrote:
>
>     Dear Yorick,
>
>     An MPhil student of mine is currently working on automatic Web
>     page pairing for bitext mining via automatic URL pairing pattern
>     discovery. It relies on no pre-defined pattern or any text
>     content/structure comparison but only on a best-first search for
>     an optimal set of patterns/keys within the URL strings from a Web
>     site in terms of their linking power (= the number of possible Web
>     page pairs they can paired up).  That is, it works on URL strings
>     only (+language identification, of course). It is so simple that
>     everyone may try, I think. It is so simple that I've got a piece
>     of comment from a conference reviewer that it is too simple for
>     publication, although simplicity = beauty, in a sense. -:)  Our
>     experiments show that it can achieve an F-score of 96.4% on web
>     page pairing for HK bilingual Web sites. I am not sure if this
>     simple technique could help a bit to get an estimation of the
>     figures you are interested.
>
>     Chunyu Kit and Jessica Y. H.  Ng. 2007. An intelligent Web agent
>     to mine bilingual parallel pages via automatic discovery of URL
>     pairing patterns
>     <http://personal.cityu.edu.hk/%7Ectckit/papers/Kit-Ng_URLpairing-PID483174.pdf>.
>     To appear in  the Agents and Data Mining Interaction Workshop 
>     (ADMI-07),  Silicon Valley, California,  November 2-5, 2007. (But
>     please mind a flaw in the formulation part: the search space
>     should be UxU' (not U), similarly to TxT' for possible token pairs
>     that we fortunately got it right -:)
>
>     Best wishes,
>     Chunyu
>
>
>
>     Yorick Wilks wrote:
>
>>     Everyone is aware that some languages/cultures (e.g. Swedish,
>>     Finnish) tend to have alternative webpages in English, while
>>     others (e.g. Arabic) are much less likely to.
>>     Does anyone have any reliable figures as to the frequency of
>>     appearance   of these parallel-corpora  (in English)for different
>>     (source) languages? I am interested at the moment in :
>>     Japanese, Chinese, Korean, Spanish, Portuguese, French, German,
>>     Italian, Arabic
>>
>>      I would be grateful for any help.
>>     Regards
>>     Yorick Wilks
>>
>>
>>------------------------------------------------------------------------
>>
>>_______________________________________________
>>Corpora mailing list
>>Corpora at uib.no <mailto:Corpora at uib.no>
>>http://mailman.uib.no/listinfo/corpora
>>  
>>
>
>
>-- 
>Chunyu Kit, PhD
>Assistant Professor in Computational Linguistics
>
>Dept. of Chinese, Translation & Linguistics
>City University of Hong Kong
>83 Tat Chee Ave., Kowloon
>
>
>E-mail:ctckit at cityu.edu.hk <mailto:E-mail:ctckit at cityu.edu.hk>
>http://personal.cityu.edu.hk/~ctckit/ <http://personal.cityu.edu.hk/%7Ectckit/>
>Fax: (+852)2788 8706, 2788 8732
>Tel: (+852)2788 9310 (O), 9380 1738 (M)
>     (+86)136 5881 2972 (China Mobile)
>
>
>     _______________________________________________
>     Corpora mailing list
>     Corpora at uib.no <mailto:Corpora at uib.no>
>     http://mailman.uib.no/listinfo/corpora
>
>


-- 
Chunyu Kit, PhD
Assistant Professor in Computational Linguistics

Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon

E-mail:ctckit at cityu.edu.hk
http://personal.cityu.edu.hk/~ctckit/
Fax: (+852)2788 8706, 2788 8732
Tel: (+852)2788 9310 (O), 9380 1738 (M)
     (+86)136 5881 2972 (China Mobile)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071015/8c334f51/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list