[Corpora-List] string frequency reports for Project Gutenberg texts

Ronald Reck rreck at iama.rrecktek.com
Mon Jul 8 12:50:32 UTC 2002


Hello all,

I have created string frequency
reports for 5400+ books (400M words)
from Project Gutenberg:
http://iama.rrecktek.com/text/frequency/

they are searchable here:
http://iama.rrecktek.com/cgi-bin/apps/wordfind/searchpg.pl

the process is described briefly here with links to
all the src in CVS:
http://iama.rrecktek.com/text/

I am looking for help in improving
these graphs of string frequency histograms across the archive
when they are rendered in SVG:
http://iama.rrecktek.com/text/frequency/words/seeall.html

I merged some of the results into an SVG:
(its worth the plugin hassle)
http://iama.rrecktek.com/~rreck/samplesvg

I also extended the DAML ontology for PG presented here:
http://www.daml.org/ontologies/113

and created RDF metadata for the archive here:
http://iama.rrecktek.com/text/frequency/meta/

the meta data is loaded into a specialty rdf backend called
Parka. this example query shows how to get RF values for an
author's use of certain strings:
http://iama.rrecktek.com/cgi-bin/apps/parka/parka.pl

Comments, and criticisms are very appreciated,
(I know the png graphs arent labeled well, all will get fixed
in the SVG s.)


----
Ronald P. Reck                          rreck at iama.rrecktek.com



More information about the Corpora mailing list