<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
Le 16/06/2010 18:40, John MCKENNY a écrit :
<blockquote
cite="mid:076B831E3BD70140B69F1D47DF456F4A1754DF9E95@MBX.nottingham.edu.cn"
type="cite">
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">
<meta name="Generator" content="Microsoft Word 12 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:"\@SimSun";
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0cm;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;}
@page Section1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="Section1">
<p class="MsoNormal">Does anyone have a solution to the problem we
are
facing in a corpus linguistic research project? We have been given
permission
by the publishers and editors to download all issues of a journal from
the last 30 years obtainable from our university e-library in the form
of PDFs
amounting to about 3,000,000 words. Starting with a small sample
(250,000
words), we tried using various methods and software including
Wordsmith
Tools 5 to convert the PDFs into text-only files. The result so far
has
been text-only files with many words clumped together e.g.
‘inthefinalanalysisitseems’.
Breaking up these clumps is a time-consuming business. For this reason,
we haven’t
started compiling our larger corpus. We would only build the larger
corpus if
there was some kind of automated or semi-automated way to generate
text-only
files which contained all and only the alphanumeric sequences bounded
by spaces
in the original PDFs, in other words, without clumps.<o:p></o:p></p>
<p class="MsoNormal">We would be very grateful for any suggestions
you might have.<br>
</p>
</div>
</blockquote>
<br>
<br>
I used Multivalent by the past (
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
<a href="http://multivalent.sourceforge.net/">http://multivalent.sourceforge.net/</a>),
it gives surprisingly good results, especially for multi-columns
documents. It deals pretty well with footnote (which tend to be
concatenated to the previous paragraph, leading to weird sentence
segmentation). <br>
<br>
One issue I had with it was being too clever: some sequences of
letters, like "fi", are output as the character 'fi' rather than 'f'
and 'i'. You'll need a simple script to deal with those issues.<br>
<br>
And it's free software.<br>
<br>
Regards,<br>
<br>
-- <br>
Emmanuel<br>
</body>
</html>