FW: [Corpora-List] Clouds on the 'banlieues"

Thu Nov 10 14:16:24 UTC 2005

Hi Jean,

I looked at this and found it interesting.  Are you doing anything to look at the relationship between the terms or between the sources where the terms occur?

I am doing something similar.  I use the Google API to collect upto 300 URLs which I then use as a corpus to extract terms using various staistical techniques.  I am using the British National Corpus as a model of the English Language in general to identify terms which will differ their various statistical behaviour in the corpus.  I am experiementing with processing these sources to create a knowledge model involving the terms and their interrationships - tah t is the focus of my research.  The google query etc is just a wayof collection a (somewhat dirty) corpus.

Dileep Damle
Kmi
Open University, Milton Keynes, UK
http://kmi.open.ac.uk/people/dileep/

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On Behalf Of Jean Veronis
Sent: 10 November 2005 12:25
To: corpus list
Subject: [Corpora-List] Clouds on the 'banlieues"

Hi all,

I'm sure you've heard of the French "banlieues"

You may be interested in this study. It's in French, apologies, but I am 
sure you can get the general idea. http://aixtal.blogspot.com/2005/11/blogs-banlieues-dans-les-nuages.html

Steps of the processing:

1. Get the URL of blog posts speaking of the riots using the keyword 
"banlieues" (Technorati API)
2. Get the full text of posts
3. Extract terms (thanks to Didier Bourigault's Syntex program) 4. Diplay as a "cloud" and link to contexts

Quick and easy (less than an hour of work). Could be fully automated 
with very little effort.

I'd be happy to know if other Corporists are working on similar systems.

-- 
Jean Véronis
  Web:  http://www.up.univ-mrs.fr/veronis
  Blog: http://aixtal.blogspot.com