[Corpora-List] Query about the (dual) language of web pages

P Resnik psresnik at gmail.com
Fri Oct 12 14:43:02 UTC 2007


Chunyu, your paper is very nice.  A key to the success of your approach
seems to be the fact that you experimented only on data from Hong Kong
government Web sites, where one could expect both a higher density of
parallel pages and a greater degree of conformity to URL naming
regularities, as compared to the Web sites in general.  The language
independence of your student's approach is attractive and should make it
easy to investigate a broader range of language pairs and types of Web
sites, in order to see how much of a difference that makes.  Is this a
direction you are exploring, or which you plan to explore?  (If so, the URL
sets at http://umiacs.umd.edu/~resnik/strand/ would make it easy to find
hosts already known to contain parallel pages for several language pairs.)

Best regards,

  Philip


On 10/11/07, Kit Chun Yu <ctckit at cityu.edu.hk> wrote:
>
>  Dear Yorick,
>
> An MPhil student of mine is currently working on automatic Web page
> pairing for bitext mining via automatic URL pairing pattern discovery. It
> relies on no pre-defined pattern or any text content/structure comparison
> but only on a best-first search for an optimal set of patterns/keys within
> the URL strings from a Web site in terms of their linking power (= the
> number of possible Web page pairs they can paired up).  That is, it works on
> URL strings only (+language identification, of course). It is so simple that
> everyone may try, I think. It is so simple that I've got a piece of comment
> from a conference reviewer that it is too simple for publication, although
> simplicity = beauty, in a sense. -:)  Our experiments show that it can
> achieve an F-score of 96.4% on web page pairing for HK bilingual Web
> sites. I am not sure if this simple technique could help a bit to get an
> estimation of the figures you are interested.
>
> Chunyu Kit and Jessica Y. H.  Ng. 2007. An intelligent Web agent to mine
> bilingual parallel pages via automatic discovery of URL pairing patterns<http://personal.cityu.edu.hk/%7Ectckit/papers/Kit-Ng_URLpairing-PID483174.pdf>.
> To appear in  the Agents and Data Mining Interaction Workshop  (ADMI-07),
> Silicon Valley, California,  November 2-5, 2007. (But please mind a flaw
> in the formulation part: the search space should be UxU' (not U), similarly
> to TxT' for possible token pairs that we fortunately got it right -:)
>
> Best wishes,
> Chunyu
>
>
>
> Yorick Wilks wrote:
>
> Everyone is aware that some languages/cultures (e.g. Swedish, Finnish)
> tend to have alternative webpages in English, while others (e.g. Arabic)
> are much less likely to. Does anyone have any reliable figures as to the
> frequency of appearance   of these parallel-corpora  (in English)for
> different (source) languages? I am interested at the moment in :
> Japanese, Chinese, Korean, Spanish, Portuguese, French, German, Italian,
> Arabic
>
>   I would be grateful for any help.
> Regards
> Yorick Wilks
>
>
>  ------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.nohttp://mailman.uib.no/listinfo/corpora
>
>
>
> --
> Chunyu Kit, PhD
> Assistant Professor in Computational Linguistics
>
> Dept. of Chinese, Translation & Linguistics
> City University of Hong Kong
> 83 Tat Chee Ave., Kowloon
>
> E-mail:ctckit at cityu.edu.hkhttp://personal.cityu.edu.hk/~ctckit/ <http://personal.cityu.edu.hk/%7Ectckit/>
> Fax: (+852)2788 8706, 2788 8732
> Tel: (+852)2788 9310 (O), 9380 1738 (M)
>      (+86)136 5881 2972 (China Mobile)
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071012/d16be102/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list