<div>We are working on collecting language specific feed urls of blogs, news sites, social networks, and any frequently updated websites. Our aim is to provide temporal corpora something similar to Spinn3r but with a focus on providing deduped and cleaned corpora for NLP/lexicographic applications. Any pointers to the below questions are greatly appreciated.</div>
<div><br></div><div>1. How to discover language specific feed urls (rss/atom)? Is crawling the web the only solution? Is it possible to piggyback on search engines (preferably Bing)?</div><div><br></div><div>2. Are there any large open-source repositories for language specific feed urls?</div>
<div><br></div><div>3. What are the best practices in implementing a feed aggregator? Any known statistics on the percentage of pingback blogs in the blogosphere?</div><div><br></div><div>4. Are there any existing feed aggregators which can handle millions of feeds (intelligently)?</div>
<div><br></div><div>5. Existing affordable licensed tools/open-source tools implementing any of the above steps?</div><div><br></div><div>We would also like to know more about similar projects by other groups (and possibly collaborate).</div>
<div><br></div><div>thanks very much,</div><div><br></div><div>Siva</div><div><br></div><div><div>=================================================</div>Siva Reddy <a href="http://www.sivareddy.in/" target="_blank">http://www.sivareddy.in</a><div>
<div>Lexical Computing Ltd. <a href="http://www.sketchengine.co.uk/" target="_blank">http://www.sketchengine.co.uk</a></div><div>University of York <a href="http://www.cs.york.ac.uk/" target="_blank">http://www.cs.york.ac.uk</a></div>
<div>=================================================</div></div></div><div><br></div>