Corpora: line joining

Susana Sotelo Docio sdocio at usc.es
Sat Feb 24 11:33:42 UTC 2001


Hello,

> I need to fix an output from a tagger and join consecutive lines of text, so
> that, for example, this:
> de    PREP
> a   ART
> turns into this:
> da    CPR
> Does anyone know how to do this in sed or perl?

If the output of the tagger is a big file, you could prefer flex (under
unix/linux). It would be:

------------------------------file contrac.lex------------------
%%
^de\tPREP\na\tART\n   { printf("da\tCPR\n"); }
%%
------------------------------end-------------------------------

You must compile this code:

   flex contrac.lex; gcc -o contrac lex.yy.c -lfl

   contrac < tagged_text.in > tagged_text.out

If you prefer perl, the script could be something like:

------------------------------file contract.pl-----------------
#!/usr/bin/perl

while(<>)
{
  if(/de\tPREP\n/)
  {
    $newline = <>;
    if($newline =~ /^a\tART\n/) { print "da\tCPR\n" }
    else { print $_ . $newline }
  }
  else { print }
}
--------------------------------end----------------------------

Syntax:
  contrac.pl tagged_text.in > tagged_text.out

Under DOS, you must replace \n with \r\n. I assume tabs between word forms
and tags.
Greetings,
Susana.

----------------------------------------------------------------------
Susana Sotelo Docío
Facultade de Filoloxía                         sdocio at usc.es   _o)
Universidade de Santiago         http://web.usc.es/~fesdocio   / \\
"Neunu ti at a abberrer mai si thocceddas a sas jannas       _(___V
cun mudos thoccos de ocros"                                  #96506
----------------------------------------------------------------------



More information about the Corpora mailing list