[Corpora-List] converting non-embedded tags into embedded ones

Oliver Mason O.Mason at bham.ac.uk
Sun Feb 24 16:16:09 UTC 2008


The easiest solution (which also allows for several lex-elements on
the same line, provided they are *always* on a line (ie no breaks
between lex elements)) is to use sed, the unix stream editor, which
doesn't even require any manual preparation:

cat your_text_file | sed 's/<lex pos=\([^>]*\)>\([^<]*\)<\/lex>/\2_\1
/g' > output_file

But then, using XML tools is probably a bit more user-friendly...  and
safer, as it doesn't rely on the exact formatting.  But I feel more
comfortable with sed than with XSLT :)

Oliver



PS What this expression does is to replace the whole line ('s' for
substitute) by the matched sub-expressions (the bits between \(...\) -
in reverse order, hence \2 ('time' in the example) and \1 ('NN').  The
final 'g' means global, ie more than once a line if applicable.  'sed'
can be a little daunting, but it is very powerful.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list