[Corpora-List] How to download text from the web to build a corpus ?

Albretch Mueller lbrtchx at gmail.com
Tue Jul 24 19:33:09 UTC 2012


 Hi Imene,
~
 something that I think hasn't been pointed out to you is that you may
just roam Arabic language websites and get the textual data directly
from them.
~
 Say you want the content on this page from Al Jazeera's site:
~
 http://www.aljazeera.net/news/pages/d2aec553-33c8-4bc9-8fb5-6be9742d1bca
~
 without all the junk (javascrit, navigation bar, ads, ...). That page
contains the link to the textual version:
~
 http://www.aljazeera.net/news/pages/d2aec553-33c8-4bc9-8fb5-6be9742d1bca#
~
 All there needs to be done is:
~
 0) download just the page (no images or just the ones the content refers to)
 1) making sure that the html is well-formed
 2) using XSLT scrape the page
 3) while you do 2) the link to the textual version should be parsed out
 4) you may want to ensure that the character set of the page is UTF-8
 5) the file with the textual version should be saved in a path/file
that somehow resembles the URL
~
 Anyone that knows a bit of coding (Java, Python, ...) could do this
(Heck! As you could see yourself, learning to do this is not a big
deal ;-)). Since Java has become my L1 lately, I would use:
~
 0) HTTPCLient
 1) HTMLCLeaner or JTidy
 2, 3, 4, 5) are trivial using Java
~
 You may keep your local copy of Arabic-language pages by using RSS
feeds from the sites
~
 lbrtchx

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list