[Corpora-List] Corpus from Blogs required.

Trilok Khairnar trilokgk at gmail.com
Mon Apr 4 08:45:44 UTC 2005


Hello Jean-Phi, Gilad 

Thanks for the inputs.

Permalinks and Technorati APIs will definitely be useful.

Technorati APIS provide - inbound and outbound links of a blog, basic
user and blog info etc. but not the list of posts on a blog and their
text.

On the other hand, permalinks should be useful to extract the text of
one blog post at a time though surrounding text on the blog like
badges and blogroll will be included too. (Looks like a hack will be
required to extract only the text of a post when permalink is
available.)

I will try this sometime using Atom.Net and RSS.Net libraries and let
the list-members know.

Thanks,
Trilok.

On Mar 31, 2005 4:05 AM, Jean-Phi <jpprost at gmail.com> wrote:
> Hi,
> 
> > In the absence of such corpus and APIs, I am thinking of doing this by
> > 1] using RSS, ATOM feed parsers on some OPML files to get URLs for blog posts
> > 2] Extracting the text (easier if the blog template format is known)
> 
> It might not be that easy: I suspect that many blogs use some sort of
> Content Management System, which basically means that the texts are
> stored in a database, and are only presented in the blog dynamically,
> on request.In such cases my guess is that you'll probably need to know
> a minimum about the database structure in order to query it --unless,
> of course, the site provides you with an RSS feed. Or do I miss
> something?
> 
> Some blog host sites may sometimes also couple the dynamic rendering
> with a permanent html link for each text. http://www.blogger.com/ (now
> owned by google) does provide both these features: RSS feed and
> permanent link. I don't hold any shares, though...
> 
> Cheers,
> --
>  Jean-Philippe Prost
>    Centre for Language Technology
>    Macquarie University ~ Sydney, Australia
> and
>    Laboratoire Parole et Langage (Speech & Language Lab.)
>    Université de Provence ~ Aix-en-Provence, France
> <http://www.ics.mq.edu.au/~jpprost/>
> _______________________________________________
> 
>



More information about the Corpora mailing list