[Corpora-List] textual data cleansing and management APIs ...

Albretch Mueller lbrtchx at gmail.com
Tue Jan 6 18:24:36 UTC 2015


 AFAIK there is very little (if any) open source APIs or agreed upon
protocols which help with data management, cleansing, classification
and coordination. Most people doing corpora research do their own work
in a cookbook style and advertise their corpora once it is encoded and
(ready to be) annotated.

 Most text banks (e.g.: archive.org/details/texts, gutenberg.org)
contain tones of texts but their content isn't ready for corpora
research. I (lbrtchx) have tried to convince maintainers of text banks
to cleanse and maintain their data in a way that is more friendly to
coordination and collaboration:

 http://www.pgdp.net/phpBB2/viewtopic.php?t=46221

 http://www.pgdp.net/phpBB2/viewtopic.php?t=45708

 https://archive.org/post/1023377/heritrix-data-only-renderings-and-consolidation-remote-to-local-address-mappings

 https://archive.org/post/277736/any-proofreading-of-the-texts-you-include-in-your-collections

 but, since they seem to have the single reader use case in mind they
don't seem to understand, let alone care (what on earth is that thing
about gutenberg.org paginating their text's lines (and even DNA data)
as if people were reading it on mainframes terminals!). I think we
(corpora researchers) will have to take care of that business on our
own.

 I heard (I think it was wikipedia's) Jimmy Wales talking about
providing people with snapshots of the (entire?) web, but anyone
attempting to build corpora from available etexts should expect an
arduous preprocessing phase and ongoing maintenance. I don't think
there is even a registry/clearinghouse of etexts out there.

 Do you of know of such studies/efforts?

 lbrtchx
 corpora at uib.no: textual data cleansing and management APIs ...

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list