<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:"\@SimSun";
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0cm;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;}
@page Section1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-GB link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal>Does anyone have a solution to the problem we are
facing in a corpus linguistic research project? We have been given permission
by the publishers and editors to download all issues of a journal from
the last 30 years obtainable from our university e-library in the form of PDFs
amounting to about 3,000,000 words. Starting with a small sample (250,000
words), we tried using various methods and software including Wordsmith
Tools 5 to convert the PDFs into text-only files. The result so far has
been text-only files with many words clumped together e.g. ‘inthefinalanalysisitseems’.
Breaking up these clumps is a time-consuming business. For this reason, we haven’t
started compiling our larger corpus. We would only build the larger corpus if
there was some kind of automated or semi-automated way to generate text-only
files which contained all and only the alphanumeric sequences bounded by spaces
in the original PDFs, in other words, without clumps.<o:p></o:p></p>
<p class=MsoNormal>We would be very grateful for any suggestions you might have.<o:p></o:p></p>
<div>
<p class=MsoNormal>Best wishes<o:p></o:p></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>John
McKenny<br>
Deputy Head of the Division of English Studies<br>
University of Nottingham Ningbo, China</span><span style='font-size:10.0pt;
font-family:"Verdana","sans-serif"'><br>
</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>199
Taikang Dong Lu<br>
Ningbo, Zhejiang Province<br>
P.R.China 315100</span><span style='font-size:10.0pt;font-family:
"Verdana","sans-serif"'><br>
<br>
</span><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'><a
href="mailto:john.mckenny@nottingham.edu.cn">john.mckenny@nottingham.edu.cn</a><o:p></o:p></span></p>
<p class=MsoNormal><o:p> </o:p></p>
</div>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
</div>
<hr style="height:1px;color:#000;" />This email has been scanned by the Altman Email Security System. For more information please visit www.altman.co.uk/emailsystems<hr style="height:1px;color:#000;" />
</body>
</html>