[Corpora-List] Looking for a XML to TEXT convertor/editor

Martin Wynne martin.wynne at oucs.ox.ac.uk
Tue Nov 28 09:37:21 UTC 2006


I'd use sed too, although I don't think Oliver's command will catch 
cases where there is a line break between the < and the >, so typically 
won't catch long comments in the markup, for example. If you run the 
following first:

cat yourxmltext | grep "<" | grep -v ">"  | less

it should show any lines with just an opening "<", and alert you to the 
presence of any potential problems.

Martin

Oliver Mason wrote:
> With sed it's even easier...
>
> cat yourxmltext | sed 's/<[^>]*>//g' > yourplaintext
>
> This removes everything in '<..>'; not as complete as Lou's earlier
> suggestion regarding XSLT, but I guess it wins the prize for the
> shortest solution...
>
> Oliver
>
> On 27/11/06, Daniel Zeman <zeman at ufal.mff.cuni.cz> wrote:
>> If you have Perl on your machine (default on Linux), the attached Perl
>> script could help you.
>
>


-- 
Martin Wynne
Head of the Oxford Text Archive and
AHDS Literature, Languages and Linguistics

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275
martin.wynne at oucs.ox.ac.uk



More information about the Corpora mailing list