[Corpora-List] Query about the (dual) language of web pages

Santos Diana Diana.Santos at sintef.no
Wed Oct 10 11:28:39 UTC 2007


Yorick, 
for languages that span different countries (and therefore cultures) you may have to separate by country and not by language. For example, the situation may be different in Portugal and in Brazil, although in both countries we write Portuguese...
 
For the particular case of Portugal, the best source of information (I am copying the authors in this mail) is the paper:

Gomes, Daniel & Mário J. Silva. "Characterizing a National Community Web", ACM Transactions on Internet Technology 5, issue 3, pp. 508-531, August 2005. http://xldb.di.fc.ul.pt/data/Publications_attach/p508-gomes.pdf%20(1).pdf

I also seem to remember that some people in Braga (North of Portugal) did some experiments with grabbing paralell English-Portuguese pages from the Web, but since I know they are on the corpora-list they may answer directly :-)

As to Romance languages on the Web, there have also been some characterizations done by the União Latina that you may want to look into. 

Back in 2002, we also did some work on estimation of Portuguese (this time language, not country!) that included parallel content, but never pursued the work:

Rachel Aires & Diana Santos. "Measuring the Web in Portuguese". In Brian Matthews, Bob Hopgood & Michael Wilson (eds.), Euroweb 2002 conference (Oxford, UK, 17-18 December 2002), pp. 198-199. Poster: http://www.linguateca.pt/Diana/download/posterAiresSantosEuroWeb2002.pdf
Abstract: http://www.linguateca.pt/Diana/download/AiresSantosEuroWeb2002.html

Needless to say, I am very interested in the results you get, I hope you will summarize them to this list later on.

Best
Diana 


________________________________

	From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Yorick Wilks
	Sent: 9. oktober 2007 18:10
	To: CORPORA
	Subject: [Corpora-List] Query about the (dual) language of web pages
	
	
	Everyone is aware that some languages/cultures (e.g. Swedish, Finnish) tend to have alternative webpages in English, while others (e.g. Arabic) are much less likely to. 
	Does anyone have any reliable figures as to the frequency of appearance   of these parallel-corpora  (in English)for different (source) languages? I am interested at the moment in :
	Japanese, Chinese, Korean, Spanish, Portuguese, French, German, Italian, Arabic
	
	
	 I would be grateful for any help.
	Regards
	Yorick Wilks

	
	


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list