[Corpora-List] Looking for a XML to TEXT convertor/editor

Notis Toufexis notis.toufexis at gmail.com
Tue Nov 28 09:52:45 UTC 2006


This one is for all who are not into sed, perl etc.

Jedit's (Java based text editor, www.jedit.org) XML plugin has a "Remove all
tags" command.

It might win the prize for the fastest way to do it, too.

Notis

On 11/28/06, Martin Wynne <martin.wynne at oucs.ox.ac.uk> wrote:
>
> I'd use sed too, although I don't think Oliver's command will catch
> cases where there is a line break between the < and the >, so typically
> won't catch long comments in the markup, for example. If you run the
> following first:
>
> cat yourxmltext | grep "<" | grep -v ">"  | less
>
> it should show any lines with just an opening "<", and alert you to the
> presence of any potential problems.
>
> Martin
>
> Oliver Mason wrote:
> > With sed it's even easier...
> >
> > cat yourxmltext | sed 's/<[^>]*>//g' > yourplaintext
> >
> > This removes everything in '<..>'; not as complete as Lou's earlier
> > suggestion regarding XSLT, but I guess it wins the prize for the
> > shortest solution...
> >
> > Oliver
> >
> > On 27/11/06, Daniel Zeman <zeman at ufal.mff.cuni.cz> wrote:
> >> If you have Perl on your machine (default on Linux), the attached Perl
> >> script could help you.
> >
> >
>
>
> --
> Martin Wynne
> Head of the Oxford Text Archive and
> AHDS Literature, Languages and Linguistics
>
> Oxford University Computing Services
> 13 Banbury Road
> Oxford
> UK - OX2 6NN
> Tel: +44 1865 283299
> Fax: +44 1865 273275
> martin.wynne at oucs.ox.ac.uk
>
>
>


-- 
http://www.early-modern-greek.org
http://www.mml.cam.ac.uk/greek/grammarofmedievalgreek/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20061128/ed2ca9d0/attachment.htm>


More information about the Corpora mailing list