[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Martin Reynaert reynaert at uvt.nl
Fri Oct 12 10:53:17 UTC 2012


Hi Ramesh,

We have written about this, on the basis of our experiences in building 
a new reference corpus for Dutch, in:

Title     "Beyond SoNaR: towards the facilitation of large corpus 
building efforts"

http://www.lrec-conf.org/proceedings/lrec2012/summaries/748.html

The e) issue you raise is not minor when building a large corpus...

Greetings,

Martin

On 10/12/2012 12:28 PM, Krishnamurthy, Ramesh wrote:
>
>
>
> Hi Mark
>
> Several people have asked recently about the easiest way to convert PDF files to plain text
>
> (including Rama Meganathan on this list). I know there are various problems:
>
> a) graphic PDFs rather than text PDFs - eg when people have scanned older texts
>
> that were not created/available as digitized text?
>
> b) columnar layout
>
> c) embedded graphics, eg photos, diagrams, graphs
>
> d) software that can only process one page at a time, or outputs one file per page
>
> e) minor irritations, such as page numbers and headers/footers that need to be edited out
>
>
>
> What is curently the easiest method/software to convert PDF files to plain text files?
>
>
>
> best
>
> Ramesh
>
> -------------------------
>
> Date: Thu, 11 Oct 2012 15:37:54 +0000
> From: Mark Davies <Mark_Davies at byu.edu>
> Subject: Re: [Corpora-List] corpus of textbooks
> To: MAT T <terrettgnome at hotmail.com>, "corpora at uib.no"
> <corpora at uib.no>
>
> Lots of free textbooks (legally!) at: http://www.ck12.org/ . Just download the PDF's and convert to text.
>
> Mark Davies
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list