[Corpora-List] Fwd: converting non-embedded tags into embedded ones

Stefan Th. Gries stgries at gmail.com
Mon Feb 25 15:09:10 UTC 2008


Hi

Once you have the corpora as text files rather than word, you could do
it in R (<http://www.r-project.org/>):

 # a ridiculously oversimplified corpus line called x,
 # which hopefully still conveys the idea
 x<-"<lex pos=NN>time</lex> <lex pos=JJ>funny</lex>"

 # a regular expression that does what you want
 gsub("<.*?pos=([^>]*)>([^<]*?)</.*?>([^<]*)", "\\2_\\1\\3", x, perl=TRUE)

 # the result
 "time_NN funny_JJ"

Of course, this may have to be adapted in the light of how the rest of
 ypour corpus looks like. Stuff like this will be explained in my
forthcoming textbook /Quantitative Corpus Linguistics with R: A
Practical Introduction/; the companion website is at
<http://groups.google.com/group/corpling-with-r/web/quantitative-corpus-linguistics-with-r>
and the newsgroup where more such questions could also be posted is at
<http://groups.google.com/group/corpling-with-r>.

# And here's an explanation of the regular expression:
Match the character "<" literally «<»
Match any single character that is not a line break character «.*?»
  Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the characters "pos=" literally «pos=»
Match the regular expression below and capture its match into
backreference number 1 «([^>]*)»
  Match any character that is NOT a ">" «[^>]*»
     Between zero and unlimited times, as many times as possible,
giving back as needed (greedy) «*»
Match the character ">" literally «>»
Match the regular expression below and capture its match into
backreference number 2 «([^<]*?)»
  Match any character that is NOT a "<" «[^<]*?»
     Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the characters "</" literally «</»
Match any single character that is not a line break character «.*?»
  Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the character ">" literally «>»
Match the regular expression below and capture its match into
backreference number 3 «([^<]*)»
  Match any character that is NOT a "<" «[^<]*»
     Between zero and unlimited times, as many times as possible,
giving back as needed (greedy) «*»

HTH,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list