[Corpora-List] string frequency reports for Project Gutenberg texts
Ronald Reck
rreck at iama.rrecktek.com
Mon Jul 8 12:50:32 UTC 2002
Hello all,
I have created string frequency
reports for 5400+ books (400M words)
from Project Gutenberg:
http://iama.rrecktek.com/text/frequency/
they are searchable here:
http://iama.rrecktek.com/cgi-bin/apps/wordfind/searchpg.pl
the process is described briefly here with links to
all the src in CVS:
http://iama.rrecktek.com/text/
I am looking for help in improving
these graphs of string frequency histograms across the archive
when they are rendered in SVG:
http://iama.rrecktek.com/text/frequency/words/seeall.html
I merged some of the results into an SVG:
(its worth the plugin hassle)
http://iama.rrecktek.com/~rreck/samplesvg
I also extended the DAML ontology for PG presented here:
http://www.daml.org/ontologies/113
and created RDF metadata for the archive here:
http://iama.rrecktek.com/text/frequency/meta/
the meta data is loaded into a specialty rdf backend called
Parka. this example query shows how to get RF values for an
author's use of certain strings:
http://iama.rrecktek.com/cgi-bin/apps/parka/parka.pl
Comments, and criticisms are very appreciated,
(I know the png graphs arent labeled well, all will get fixed
in the SVG s.)
----
Ronald P. Reck rreck at iama.rrecktek.com
More information about the Corpora
mailing list