[Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Fri Oct 12 15:13:13 UTC 2012

Many thanks, Min! :)

________________________________________
From: Min-Yen Kan [knmnyn at gmail.com]
Sent: 12 October 2012 13:27
To: Andrew Gilbert
Cc: Krishnamurthy, Ramesh; corpora at uib.no
Subject: Re: [Corpora-List] corpus of textbooks; "Just download the PDF's and convert to text"

Hi all:

We had a separate discussion about this problem a while ago in our
digital anthology mailing list, largely concerning the ACL Anthology
proceedings collection.

http://wing.comp.nus.edu.sg/pipermail/danth/2006-August/000002.html

However, some new software might be suitable for you work.  You can
take a look at PDFX and PDFExtract.

http://pdfx.cs.man.ac.uk/

http://moin.delph-in.net/AclAnthologyCorpus

You might also check out the related work from the R50 workshop
related to the latter link.

http://translit.i2r.a-star.edu.sg/r50/

Cheers,

Min

--
Min-Yen KAN (Dr) :: Associate Professor :: National University of
Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive
Singapore 117417 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
kanmy at comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

Important: This email is confidential and may be privileged. If you
are not the intended recipient, please delete it and notify us
immediately; you should not copy or use it for any purpose, nor
disclose its contents to any other person. Thank you.

On Fri, Oct 12, 2012 at 7:30 PM, Andrew Gilbert <andy at agilbert.net> wrote:
> poppler is an OSS package with some nice weapons for this
>
> pdftotext will convert to plain text
>
> But perhaps more helpful for retaining some of the column and layout information, can also use pdftohtml to convert to xml format with positional data, for example:
>
> pdftohtml -xml input.pdf output.xml
>
> <text top="79" left="652" width="171" height="12" font="2"><b>LOCATION: NIKKEN BUILDING</b></text>
> <text top="92" left="121" width="129" height="12" font="4">Woodland Hills, CA 91367</text>
> <text top="91" left="652" width="140" height="12" font="2"><b>                    52 Discovery</b></text>
>
>
> Andrew Gilbert
> andy at agilbert.net
> (m) 802-535-1653
> (h) 802-426-2108
>
>
>
>
>
> On Oct 12, 2012, at 6:28 AM, "Krishnamurthy, Ramesh" <r.krishnamurthy at aston.ac.uk> wrote:
>
>>
>>
>>
>>
>> Hi Mark
>>
>> Several people have asked recently about the easiest way to convert PDF files to plain text
>>
>> (including Rama Meganathan on this list). I know there are various problems:
>>
>> a) graphic PDFs rather than text PDFs - eg when people have scanned older texts
>>
>> that were not created/available as digitized text?
>>
>> b) columnar layout
>>
>> c) embedded graphics, eg photos, diagrams, graphs
>>
>> d) software that can only process one page at a time, or outputs one file per page
>>
>> e) minor irritations, such as page numbers and headers/footers that need to be edited out
>>
>>
>>
>> What is curently the easiest method/software to convert PDF files to plain text files?
>>
>>
>>
>> best
>>
>> Ramesh
>>
>> -------------------------
>>
>> Date: Thu, 11 Oct 2012 15:37:54 +0000
>> From: Mark Davies <Mark_Davies at byu.edu>
>> Subject: Re: [Corpora-List] corpus of textbooks
>> To: MAT T <terrettgnome at hotmail.com>, "corpora at uib.no"
>> <corpora at uib.no>
>>
>> Lots of free textbooks (legally!) at: http://www.ck12.org/ . Just download the PDF's and convert to text.
>>
>> Mark Davies
>>
>> ============================================
>> Mark Davies
>> Professor of Linguistics / Brigham Young University
>> http://davies-linguistics.byu.edu/
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> ============================================
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora