[Corpora-List] Query about the (dual) language of web pages
Kit Chun Yu
ctckit at cityu.edu.hk
Thu Oct 11 09:57:43 UTC 2007
Dear Yorick,
An MPhil student of mine is currently working on automatic Web page
pairing for bitext mining via automatic URL pairing pattern discovery.
It relies on no pre-defined pattern or any text content/structure
comparison but only on a best-first search for an optimal set of
patterns/keys within the URL strings from a Web site in terms of their
linking power (= the number of possible Web page pairs they can paired
up). That is, it works on URL strings only (+language identification,
of course). It is so simple that everyone may try, I think. It is so
simple that I've got a piece of comment from a conference reviewer that
it is too simple for publication, although simplicity = beauty, in a
sense. -:) Our experiments show that it can achieve an F-score of 96.4%
on web page pairing for HK bilingual Web sites. I am not sure if this
simple technique could help a bit to get an estimation of the figures
you are interested.
Chunyu Kit and Jessica Y. H. Ng. 2007. An intelligent Web agent to mine
bilingual parallel pages via automatic discovery of URL pairing patterns
<http://personal.cityu.edu.hk/%7Ectckit/papers/Kit-Ng_URLpairing-PID483174.pdf>.
To appear in the Agents and Data Mining Interaction Workshop
(ADMI-07), Silicon Valley, California, November 2-5, 2007. (But please
mind a flaw in the formulation part: the search space should be UxU'
(not U), similarly to TxT' for possible token pairs that we fortunately
got it right -:)
Best wishes,
Chunyu
Yorick Wilks wrote:
> Everyone is aware that some languages/cultures (e.g. Swedish, Finnish)
> tend to have alternative webpages in English, while others (e.g.
> Arabic) are much less likely to.
> Does anyone have any reliable figures as to the frequency of
> appearance of these parallel-corpora (in English)for different
> (source) languages? I am interested at the moment in :
> Japanese, Chinese, Korean, Spanish, Portuguese, French, German,
> Italian, Arabic
>
> I would be grateful for any help.
> Regards
> Yorick Wilks
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
>
>
--
Chunyu Kit, PhD
Assistant Professor in Computational Linguistics
Dept. of Chinese, Translation & Linguistics
City University of Hong Kong
83 Tat Chee Ave., Kowloon
E-mail:ctckit at cityu.edu.hk
http://personal.cityu.edu.hk/~ctckit/
Fax: (+852)2788 8706, 2788 8732
Tel: (+852)2788 9310 (O), 9380 1738 (M)
(+86)136 5881 2972 (China Mobile)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071011/83065d56/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list