[Corpora-List] Computer books: parallel corpus?
Alexander Yeh
asy at mitre.org
Tue Dec 3 01:19:50 UTC 2013
Darren Cook wrote:
> I just discovered Free Programming Books:
>
> https://github.com/vhf/free-programming-books/blob/master/free-programming-books.md#professional-development
>
> and then in 13 other languages:
> https://github.com/vhf/free-programming-books#in-other-speaking-languages
>
> For the other languages it looks like quite a few are translations of
> English titles. Has anyone already done the work to match them up?
>
> (And if so, what about the next step of splitting up the PDFs, and then
> matching up chapters, paragraphs, or even sentences?)
My small bit: My experience with extracting text from PDFs is that it is
a non-trivial task to do a good job on this task.
>
> Darren
>
> P.S. There is quite a lot of other potential research on this body of
> books, even just sticking to one language. E.g. automatic glossary
> extraction. E.g. comparison of metrics of sentence length or complexity
> or word variety with other genre of books. (Do computer book authors
> simplify their language, because the topic is already complicated
> enough?) (How does language in computer books differ from language in
> computer science papers?)
> E.g. can we take two or three books on the same topic and automatically
> identify differences in them (which might point to content errors)?
>
>
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list