[Corpora-List] tool for extracting text from web forum and websites

Stefan Th. Gries stgries at gmail.com
Thu Oct 15 22:39:04 UTC 2009


You can use R (<http://www.r-project.org/>) to download files and
clean them easily: to load the contents of
<http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html>,
you just enter this at the console

(x <- gsub("<[^>]*?>", "",
scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html",
what=character(0), sep="\n",  quote="", comment.char=""), perl=T))

or this (to print it into a file called <res.txt>):

x <- gsub("<[^>]*?>", "",
scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html",
what=character(0), sep="\n",  quote="", comment.char=""), perl=T)
cat(x, file="res.txt", sep="\n")

Cf. <http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf>
for a more detailed application.

HTH,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list