[Corpora-List] Web pages corpus

Chris Jordan cjordan at cs.dal.ca
Mon Mar 6 14:49:10 UTC 2006


Hello,

I am actually interested in a standard Web Corpus as well. The reason I 
do not want to compile my own is that it is then difficult to compare 
the results from my experiments to other reported results. As well, I am 
hoping for an annotated corpus that includes lots of other valuable 
information such as genre, topic, and abstract which has to be added by 
assessors / subject matter experts.

Imen, depending on your project / experiment, I would carefully consider 
what you are attempting to show. Creating a corpus is an option however 
it may make your experiment a one off and lower the value of your 
results. Furthermore, since you are doing document summarization, if you 
use your own corpus, you will be limited to performing a user evaluation 
to assess it's capabilities. Generally these types of evaluations are 
beyond the scope of a Master's.

Chris

Jakob Halskov wrote:

>Dear Imen,
>
>It is very easy to compile a web corpus on your own using one of the freely available web search APIs. See for example:
>
>http://developer.yahoo.net/search/index.html
>
>or
>
>http://www.google.com/apis/
>
>Best regards,
>
>Jakob Halskov
>--
>PhD student
>Dept. of Computational Linguistics
>Copenhagen Business School
>www.id.cbs.dk
>
>----- Original Message -----
>From: "ismi.touati" <ismi.touati at laposte.net>
>Date: Monday, March 6, 2006 12:29 pm
>Subject: [Corpora-List] Web pages corpus
>
>  
>
>>Dear all,
>>
>>I'm working on automatic summarization of web pages, i'm looking 
>>for a corpus of web 
>>
>>pages (html documents) with their abstract to evaluate my system. 
>>
>>Does anyone knows if such a corpus exists?
>>
>>Thanks in advance for the help.
>>Imen.
>>
>>***********************************
>>Imen Touati
>>Master Student at Faculty of Economic Science and management of 
>>sfax, 
>>Tunisia.
>>LARIS laboratory
>>Addresse : LARIS, FSEGS, BP 1088, 3018 Sfax, Tunisia
>>
>>Accédez au courrier électronique de La Poste : www.laposte.net ; 
>>3615 LAPOSTENET (0,34 ?/mn) ; tél : 08 92 68 13 50 (0,34?/mn)
>>
>>
>>
>>    
>>
>
>
>  
>



More information about the Corpora mailing list