[Corpora-List] Fwd: converting non-embedded tags into embedded ones
Stefan Th. Gries
stgries at gmail.com
Mon Feb 25 15:09:10 UTC 2008
Hi
Once you have the corpora as text files rather than word, you could do
it in R (<http://www.r-project.org/>):
# a ridiculously oversimplified corpus line called x,
# which hopefully still conveys the idea
x<-"<lex pos=NN>time</lex> <lex pos=JJ>funny</lex>"
# a regular expression that does what you want
gsub("<.*?pos=([^>]*)>([^<]*?)</.*?>([^<]*)", "\\2_\\1\\3", x, perl=TRUE)
# the result
"time_NN funny_JJ"
Of course, this may have to be adapted in the light of how the rest of
ypour corpus looks like. Stuff like this will be explained in my
forthcoming textbook /Quantitative Corpus Linguistics with R: A
Practical Introduction/; the companion website is at
<http://groups.google.com/group/corpling-with-r/web/quantitative-corpus-linguistics-with-r>
and the newsgroup where more such questions could also be posted is at
<http://groups.google.com/group/corpling-with-r>.
# And here's an explanation of the regular expression:
Match the character "<" literally «<»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the characters "pos=" literally «pos=»
Match the regular expression below and capture its match into
backreference number 1 «([^>]*)»
Match any character that is NOT a ">" «[^>]*»
Between zero and unlimited times, as many times as possible,
giving back as needed (greedy) «*»
Match the character ">" literally «>»
Match the regular expression below and capture its match into
backreference number 2 «([^<]*?)»
Match any character that is NOT a "<" «[^<]*?»
Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the characters "</" literally «</»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible,
expanding as needed (lazy) «*?»
Match the character ">" literally «>»
Match the regular expression below and capture its match into
backreference number 3 «([^<]*)»
Match any character that is NOT a "<" «[^<]*»
Between zero and unlimited times, as many times as possible,
giving back as needed (greedy) «*»
HTH,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list