[Corpora-List] converting non-embedded tags into embedded ones

Emiliano Guevara emiliano.guevara at unibo.it
Sun Feb 24 09:57:31 UTC 2008


Dear Warren,

No idea if and/or how you could do that in M$ Word or Windows... or  
even why you would try to do corpus linguistics with an application  
that is designed to write business letters...

But if have a access to a Linux/Unix box, a bit of regex and AWK  
would solve your problem in seconds.

Assuming that the corpus is REALLY ALWAYS like this:
   <lex pos=NN>time</lex>
and that every <lex></lex> element is on a separate line:

1. open your favorite text editor (capable of doing general search  
and replace with regexes)

2. delete "<lex pos=" string,
    delete "</lex>" string,
    you're left with:
    "NN>time"

3. open a shell and do
    awk 'BEGIN {FS=">"; OFS = "_";}{print $2, $1}'

That's all.

If you really cannot work on anything else different than windows/M$  
Word, I think you could try doing steps 1 and 2 on M$ Word manually,  
just use Find/Replace (never try doing macros.... bad thing!) and  
then convert the corpus with the format "NN>time" to a huge table,  
columns divided by ">".
After that, grab the second column of the table and move to the  
desired position.
Then reconvert everything back into text format.

good luck,

E.

On 24 Feb 2008, at 10:15, Warren Tang wrote:

> Could someone help me with this problem:
>
> I have texts with non-embedded tags:
>
> eg: <lex pos=NN>time</lex>
>
> but I would like to convert them to embedded tags (if this is the  
> right term):
>
> eg: time_NN
>
> I have tried using a macro in MS Word but can't seem to find a way  
> to get it to do it. I do not know how to program so your expertise  
> here would be most appreciated.
>
>
> Warren
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

****************************************
Emiliano R. Guevara
Facoltà di Lingue e Lett. Straniere
Dip. di Lingue e Lett. Straniere
Università di Bologna
Via Cartoleria 5 (40124) Bologna, Italia

Homepage: http://morbo.lingue.unibo.it/

E-mail:   emiliano.guevara at unibo.it
           emiguevara at gmail.com
****************************************


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list