[Corpora-List] Computer books: parallel corpus?
Darren Cook
darren at dcook.org
Tue Dec 3 01:37:25 UTC 2013
>> (And if so, what about the next step of splitting up the PDFs, and then
>> matching up chapters, paragraphs, or even sentences?)
>
> My small bit: My experience with extracting text from PDFs is that it is
> a non-trivial task to do a good job on this task.
Same experience here; and different tools have worked better on
different PDFs.
In this case it may be possible to get original data from the authors,
for some of the books.
Darren
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list