[Corpora-List] Query about the (dual) language of web pages
Santos Diana
Diana.Santos at sintef.no
Wed Oct 10 11:28:39 UTC 2007
Yorick,
for languages that span different countries (and therefore cultures) you may have to separate by country and not by language. For example, the situation may be different in Portugal and in Brazil, although in both countries we write Portuguese...
For the particular case of Portugal, the best source of information (I am copying the authors in this mail) is the paper:
Gomes, Daniel & Mário J. Silva. "Characterizing a National Community Web", ACM Transactions on Internet Technology 5, issue 3, pp. 508-531, August 2005. http://xldb.di.fc.ul.pt/data/Publications_attach/p508-gomes.pdf%20(1).pdf
I also seem to remember that some people in Braga (North of Portugal) did some experiments with grabbing paralell English-Portuguese pages from the Web, but since I know they are on the corpora-list they may answer directly :-)
As to Romance languages on the Web, there have also been some characterizations done by the União Latina that you may want to look into.
Back in 2002, we also did some work on estimation of Portuguese (this time language, not country!) that included parallel content, but never pursued the work:
Rachel Aires & Diana Santos. "Measuring the Web in Portuguese". In Brian Matthews, Bob Hopgood & Michael Wilson (eds.), Euroweb 2002 conference (Oxford, UK, 17-18 December 2002), pp. 198-199. Poster: http://www.linguateca.pt/Diana/download/posterAiresSantosEuroWeb2002.pdf
Abstract: http://www.linguateca.pt/Diana/download/AiresSantosEuroWeb2002.html
Needless to say, I am very interested in the results you get, I hope you will summarize them to this list later on.
Best
Diana
________________________________
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Yorick Wilks
Sent: 9. oktober 2007 18:10
To: CORPORA
Subject: [Corpora-List] Query about the (dual) language of web pages
Everyone is aware that some languages/cultures (e.g. Swedish, Finnish) tend to have alternative webpages in English, while others (e.g. Arabic) are much less likely to.
Does anyone have any reliable figures as to the frequency of appearance of these parallel-corpora (in English)for different (source) languages? I am interested at the moment in :
Japanese, Chinese, Korean, Spanish, Portuguese, French, German, Italian, Arabic
I would be grateful for any help.
Regards
Yorick Wilks
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list