[Corpora-List] tool for extracting text from web forum and websites
Stefan Th. Gries
stgries at gmail.com
Thu Oct 15 22:39:04 UTC 2009
You can use R (<http://www.r-project.org/>) to download files and
clean them easily: to load the contents of
<http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html>,
you just enter this at the console
(x <- gsub("<[^>]*?>", "",
scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html",
what=character(0), sep="\n", quote="", comment.char=""), perl=T))
or this (to print it into a file called <res.txt>):
x <- gsub("<[^>]*?>", "",
scan("http://www.linguistics.ucsb.edu/faculty/stgries/research/overview-research.html",
what=character(0), sep="\n", quote="", comment.char=""), perl=T)
cat(x, file="res.txt", sep="\n")
Cf. <http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf>
for a more detailed application.
HTH,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list