[Corpora-List] Computer books: parallel corpus?

Darren Cook darren at dcook.org
Tue Dec 3 00:40:35 UTC 2013


I just discovered Free Programming Books:

https://github.com/vhf/free-programming-books/blob/master/free-programming-books.md#professional-development

and then in 13 other languages:
  https://github.com/vhf/free-programming-books#in-other-speaking-languages

For the other languages it looks like quite a few are translations of
English titles. Has anyone already done the work to match them up?

(And if so, what about the next step of splitting up the PDFs, and then
matching up chapters, paragraphs, or even sentences?)

Darren

P.S. There is quite a lot of other potential research on this body of
books, even just sticking to one language. E.g. automatic glossary
extraction. E.g. comparison of metrics of sentence length or complexity
or word variety with other genre of books. (Do computer book authors
simplify their language, because the topic is already complicated
enough?) (How does language in computer books differ from language in
computer science papers?)
E.g. can we take two or three books on the same topic and automatically
identify differences in them (which might point to content errors)?


-- 
Darren Cook, Software Researcher/Developer

http://dcook.org/work/ (About me and my work)
http://dcook.org/blogs.html (My blogs and articles)

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list