[Corpora-List] converting non-embedded tags into embedded ones

Stefan Th. Gries stgries at gmail.com
Sun Feb 24 17:27:10 UTC 2008


Hi

> Subject: [Corpora-List] converting non-embedded tags into embedded ones
> To: Corpora_AT_uib.no
> Could someone help me with this problem:
> I have texts with non-embedded tags:
> eg: <lex pos=NN>time</lex>
> but I would like to convert them to embedded tags (if this is the right term):
> eg: time_NN
> I have tried using a macro in MS Word but can't seem to find a way to get it to do it. I do not know how to program so your expertise here would be most appreciated.

Once you have the corpora as text files rather than word, you could do
it in R (<http://www.r-project.org/>):

# a ridiculously oversimplified corpus line called x,
# which hopefully still conveys the idea
x<-"<lex pos=NN>time</lex> <lex pos=JJ>funny</lex>"

# a regular expression that does what you want
gsub("<.*?pos=([^>]*)>([^<]*?)</.*?>([^<]*)", "\\2_\\1\\3", x, perl=TRUE)

# the result
"time_NN funny_JJ"

Of course, this may have to be adapted in the light of how the rest of
ypour corpus looks like. Stuff like this will be explained in my
forthcoming textbook /Quantitative Corpus Linguistics with R: A
Practical Introduction/; the companion website is at
<http://groups.google.com/group/corpling-with-r/web/quantitative-corpus-linguistics-with-r>
and the newsgroup where more such questions could also be posted is at
<http://groups.google.com/group/corpling-with-r>.

# And here's an explanation of the regular expression:
Match the character "<" literally «<»
Match any single character that is not a line break character «.*?»
   Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the characters "pos=" literally «pos=»
Match the regular expression below and capture its match into
backreference number 1 «([^>]*)»
   Match any character that is NOT a ">" «[^>]*»
      Between zero and unlimited times, as many times as possible,
giving back as needed (greedy) «*»
Match the character ">" literally «>»
Match the regular expression below and capture its match into
backreference number 2 «([^<]*?)»
   Match any character that is NOT a "<" «[^<]*?»
      Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the characters "</" literally «</»
Match any single character that is not a line break character «.*?»
   Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the character ">" literally «>»
Match the regular expression below and capture its match into
backreference number 3 «([^<]*)»
   Match any character that is NOT a "<" «[^<]*»
      Between zero and unlimited times, as many times as possible,
giving back as needed (greedy) «*»

HTH,
STG
-- 
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list