<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html; charset=ISO-8859-1"

 http-equiv="Content-Type">

  <title></title>

</head>

<body text="#000000" bgcolor="#ffffff">

Le 16/06/2010 18:40, John MCKENNY a écrit :

<blockquote

 cite="mid:076B831E3BD70140B69F1D47DF456F4A1754DF9E95@MBX.nottingham.edu.cn"

 type="cite">

  <meta http-equiv="Content-Type"

 content="text/html; charset=ISO-8859-1">

  <meta name="Generator" content="Microsoft Word 12 (filtered medium)">

  <style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:SimSun;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:SimSun;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

@font-face

        {font-family:"\@SimSun";

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:Verdana;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri","sans-serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

p.MsoAcetate, li.MsoAcetate, div.MsoAcetate

        {mso-style-priority:99;

        mso-style-link:"Balloon Text Char";

        margin:0cm;

        margin-bottom:.0001pt;

        font-size:8.0pt;

        font-family:"Tahoma","sans-serif";}

span.EmailStyle17

        {mso-style-type:personal-compose;

        font-family:"Calibri","sans-serif";

        color:windowtext;}

span.BalloonTextChar

        {mso-style-name:"Balloon Text Char";

        mso-style-priority:99;

        mso-style-link:"Balloon Text";

        font-family:"Tahoma","sans-serif";}

.MsoChpDefault

        {mso-style-type:export-only;}

@page Section1

        {size:612.0pt 792.0pt;

        margin:72.0pt 72.0pt 72.0pt 72.0pt;}

div.Section1

        {page:Section1;}

-->

  </style>

<!--[if gte mso 9]><xml>

 <o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

 <o:shapelayout v:ext="edit">

  <o:idmap v:ext="edit" data="1" />

 </o:shapelayout></xml><![endif]-->

  <div class="Section1">

  <p class="MsoNormal">Does anyone have a solution  to the problem we

are

facing in a corpus linguistic research project? We have been given

permission

by the publishers and editors  to download all issues of a journal from

the last 30 years obtainable from our university e-library in the form

of PDFs

amounting to about 3,000,000 words. Starting with a small sample

(250,000

words), we tried  using various methods and software including

Wordsmith

Tools 5  to convert the PDFs into text-only files. The result so far

has

been text-only files with many words clumped together  e.g.

‘inthefinalanalysisitseems’. 

Breaking up these clumps is a time-consuming business. For this reason,

we haven’t

started compiling our larger corpus. We would only build the larger

corpus if

there was some kind of automated or semi-automated way to generate

text-only

files which contained all and only the alphanumeric sequences bounded

by spaces

in the original PDFs, in other words, without clumps.<o:p></o:p></p>

  <p class="MsoNormal">We would be very grateful for any suggestions

you might have.<br>

  </p>

  </div>

</blockquote>

<br>

<br>

I used Multivalent by the past (

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

<a href="http://multivalent.sourceforge.net/">http://multivalent.sourceforge.net/</a>),

it gives surprisingly good results, especially for multi-columns

documents. It deals pretty well with footnote (which tend to be

concatenated to the previous paragraph, leading to weird sentence

segmentation). <br>

<br>

One issue I had with it was being too clever: some sequences of

letters, like "fi", are output as the character 'fi' rather than 'f'

and 'i'. You'll need a simple script to deal with those issues.<br>

<br>

And it's free software.<br>

<br>

Regards,<br>

<br>

-- <br>

Emmanuel<br>

</body>

</html>