<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:"Trebuchet MS";
panose-1:2 11 6 3 2 2 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0in;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";}
span.EmailStyle19
{mso-style-type:personal;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.EmailStyle20
{mso-style-type:personal-reply;
font-family:"Trebuchet MS","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>John, <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>A tool we found useful in the compilation of MICUSP (<a
href="http://micusp.elicorpora.info/">http://micusp.elicorpora.info/</a>) is
PDF to Word (<a href="http://www.pdftoword.com/">http://www.pdftoword.com/</a>),
a free online tool that turns pdfs into doc or rtf files. We used this tool for
files that Adobe Reader couldn’t convert, for example if they were
password protected. You would then still have to open the output files in Word
and save them as text from there. For our purposes, going via Word rather than
straight to txt was the preferred option --- that way you get an editable
version of the text that is quite close to the original (including figures,
tables, etc) which makes it easier to insert gap tags or line breaks in the
right places. <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Best of luck with the project!<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Ute <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<div>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>*********************************************************<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Just launched: MICUSP Simple -- free search/browse interface to
the Michigan Corpus of Upper-level Student Papers (829 papers, around 2.6 million
words):<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><a href="http://search-micusp.elicorpora.info/simple/"><span
style='color:blue'>http://search-micusp.elicorpora.info/simple/</span></a> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Dr. Ute Römer<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Director of the Applied Corpus Linguistics Unit<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>English Language Institute<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>University of Michigan<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Email: <a href="mailto:uroemer@umich.edu"><span
style='color:blue'>uroemer@umich.edu</span></a> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Fax: +1 734 763 0369 <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><a href="http://www.elicorpora.info"><span style='color:blue'>http://www.elicorpora.info</span></a>
<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><a href="http://www.uteroemer.com"><span style='color:blue'>http://www.uteroemer.com</span></a>
<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'> <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Surface mail address: <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Dr. Ute Römer <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>University of Michigan <o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>English Language Institute <o:p></o:p></span></p>
<p class=MsoNormal><span lang=DE style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>500 E. Washington Street <o:p></o:p></span></p>
<p class=MsoNormal><span lang=DE style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>Ann Arbor, MI 48104-2028 <o:p></o:p></span></p>
<p class=MsoNormal><span lang=DE style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'>USA <o:p></o:p></span></p>
<p class=MsoNormal><span lang=DE style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'> <o:p></o:p></span></p>
</div>
<p class=MsoNormal><span lang=DE style='font-family:"Trebuchet MS","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<div>
<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span
style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>
corpora-bounces@uib.no [mailto:corpora-bounces@uib.no] <b>On Behalf Of </b>John
MCKENNY<br>
<b>Sent:</b> Wednesday, June 16, 2010 6:40 AM<br>
<b>To:</b> corpora@uib.no<br>
<b>Subject:</b> [Corpora-List] converting PDFs to ASCII or text-only files
without clumps<o:p></o:p></span></p>
</div>
</div>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span lang=EN-GB>Does anyone have a solution to the
problem we are facing in a corpus linguistic research project? We have been
given permission by the publishers and editors to download all issues of
a journal from the last 30 years obtainable from our university e-library in
the form of PDFs amounting to about 3,000,000 words. Starting with a small
sample (250,000 words), we tried using various methods and software
including Wordsmith Tools 5 to convert the PDFs into text-only files. The
result so far has been text-only files with many words clumped together
e.g. ‘inthefinalanalysisitseems’. Breaking up these clumps is
a time-consuming business. For this reason, we haven’t started compiling
our larger corpus. We would only build the larger corpus if there was some kind
of automated or semi-automated way to generate text-only files which contained
all and only the alphanumeric sequences bounded by spaces in the original PDFs,
in other words, without clumps.<o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-GB>We would be very grateful for any
suggestions you might have.<o:p></o:p></span></p>
<div>
<p class=MsoNormal><span lang=EN-GB>Best wishes<o:p></o:p></span></p>
<p class=MsoNormal style='margin-bottom:12.0pt'><span lang=EN-GB
style='font-size:10.0pt;font-family:"Arial","sans-serif"'>John McKenny<br>
Deputy Head of the Division of English Studies<br>
University of Nottingham Ningbo, China</span><span lang=EN-GB style='font-size:
10.0pt;font-family:"Verdana","sans-serif"'><br>
</span><span lang=EN-GB style='font-size:10.0pt;font-family:"Arial","sans-serif"'>199
Taikang Dong Lu<br>
Ningbo, Zhejiang Province<br>
P.R.China 315100<o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-GB style='font-size:10.0pt;font-family:"Arial","sans-serif"'><a
href="mailto:john.mckenny@nottingham.edu.cn">john.mckenny@nottingham.edu.cn</a><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p>
</div>
<p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p>
<p class=MsoNormal><span lang=EN-GB><o:p> </o:p></span></p>
<div class=MsoNormal align=center style='text-align:center'><span lang=EN-GB
style='font-size:12.0pt;font-family:"Times New Roman","serif"'>
<hr size=1 width="100%" noshade style='color:black' align=center>
</span></div>
<p class=MsoNormal><span lang=EN-GB style='font-size:12.0pt;font-family:"Times New Roman","serif"'>This
email has been scanned by the Altman Email Security System. For more
information please visit www.altman.co.uk/emailsystems<o:p></o:p></span></p>
<div class=MsoNormal align=center style='text-align:center'><span lang=EN-GB
style='font-size:12.0pt;font-family:"Times New Roman","serif"'>
<hr size=1 width="100%" noshade style='color:black' align=center>
</span></div>
</div>
</body>
</html>