Corpora: line joining
Susana Sotelo Docio
sdocio at usc.es
Sat Feb 24 11:33:42 UTC 2001
Hello,
> I need to fix an output from a tagger and join consecutive lines of text, so
> that, for example, this:
> de PREP
> a ART
> turns into this:
> da CPR
> Does anyone know how to do this in sed or perl?
If the output of the tagger is a big file, you could prefer flex (under
unix/linux). It would be:
------------------------------file contrac.lex------------------
%%
^de\tPREP\na\tART\n { printf("da\tCPR\n"); }
%%
------------------------------end-------------------------------
You must compile this code:
flex contrac.lex; gcc -o contrac lex.yy.c -lfl
contrac < tagged_text.in > tagged_text.out
If you prefer perl, the script could be something like:
------------------------------file contract.pl-----------------
#!/usr/bin/perl
while(<>)
{
if(/de\tPREP\n/)
{
$newline = <>;
if($newline =~ /^a\tART\n/) { print "da\tCPR\n" }
else { print $_ . $newline }
}
else { print }
}
--------------------------------end----------------------------
Syntax:
contrac.pl tagged_text.in > tagged_text.out
Under DOS, you must replace \n with \r\n. I assume tabs between word forms
and tags.
Greetings,
Susana.
----------------------------------------------------------------------
Susana Sotelo Docío
Facultade de Filoloxía sdocio at usc.es _o)
Universidade de Santiago http://web.usc.es/~fesdocio / \\
"Neunu ti at a abberrer mai si thocceddas a sas jannas _(___V
cun mudos thoccos de ocros" #96506
----------------------------------------------------------------------
More information about the Corpora
mailing list