[Corpora-List] Computer books: parallel corpus?

Darren Cook darren at dcook.org
Tue Dec 3 01:37:25 UTC 2013


>> (And if so, what about the next step of splitting up the PDFs, and then
>> matching up chapters, paragraphs, or even sentences?)
> 
> My small bit: My experience with extracting text from PDFs is that it is
> a non-trivial task to do a good job on this task.

Same experience here; and different tools have worked better on
different PDFs.
In this case it may be possible to get original data from the authors,
for some of the books.

Darren

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list