[Corpora-List] converting non-embedded tags into embedded ones
Oliver Mason
O.Mason at bham.ac.uk
Sun Feb 24 16:16:09 UTC 2008
The easiest solution (which also allows for several lex-elements on
the same line, provided they are *always* on a line (ie no breaks
between lex elements)) is to use sed, the unix stream editor, which
doesn't even require any manual preparation:
cat your_text_file | sed 's/<lex pos=\([^>]*\)>\([^<]*\)<\/lex>/\2_\1
/g' > output_file
But then, using XML tools is probably a bit more user-friendly... and
safer, as it doesn't rely on the exact formatting. But I feel more
comfortable with sed than with XSLT :)
Oliver
PS What this expression does is to replace the whole line ('s' for
substitute) by the matched sub-expressions (the bits between \(...\) -
in reverse order, hence \2 ('time' in the example) and \1 ('NN'). The
final 'g' means global, ie more than once a line if applicable. 'sed'
can be a little daunting, but it is very powerful.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list