[Corpora-List] Language Specific Feed Discovery and Feed Aggregators

Siva Reddy siva at sivareddy.in
Thu Jan 12 14:37:20 UTC 2012


We are working on collecting language specific feed urls of blogs, news
sites, social networks, and any frequently updated websites. Our aim is to
provide temporal corpora something similar to Spinn3r but with a focus on
providing deduped and cleaned corpora for NLP/lexicographic applications.
Any pointers to the below questions are greatly appreciated.

1. How to discover language specific feed urls (rss/atom)? Is crawling the
web the only solution? Is it possible to piggyback on search engines
(preferably Bing)?

2. Are there any large open-source repositories for language specific feed
urls?

3. What are the best practices in implementing a feed aggregator? Any known
statistics on the percentage of pingback blogs in the blogosphere?

4. Are there any existing feed aggregators which can handle millions of
feeds (intelligently)?

5. Existing affordable licensed tools/open-source tools implementing any of
the above steps?

We would also like to know more about similar projects by other groups (and
possibly collaborate).

thanks very much,

Siva

=================================================
Siva Reddy                                  http://www.sivareddy.in
Lexical Computing Ltd.                  http://www.sketchengine.co.uk
University of York                         http://www.cs.york.ac.uk
=================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120112/9079abf1/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list