[Corpora-List] C-unit tagging

chris brew cbrew at acm.org
Thu Feb 21 18:57:56 UTC 2008


All the sentence segmentation
tools that I am aware (for example David Palmer's SATZ) of tag sentence
boundaries by looking
at a pretty wide range of features of the text, some of which are really
matters of
how newspapers happen to be laid out,
and wouldn't immediately transfer to use with a spoken corpus. So I think
you
probably are not going to find an off-the-shelf tool.

In practice, the best next step is to find a friend who is good with Python,
Perl, Ruby or another
good text processing tool that handles regular expressions. Force your
friend to sit down with you
and take a very detailed look at precisely what the corpus transcription you
are working with is
like, then devise a regular expression that catches most of the boundaries
you want. The result
will probably be highly tied to the specifics of your corpus, and will
probably not be perfect, but
it will be a start.

On 21/02/2008, Su Qi Apple <applesuqi at yahoo.co.uk> wrote:
>
> Dear All
>
> I am just beginning my study in corpus linguistics and in a corpus of
> spoken English in particular. I want to ask if someone can tell me if you
> know of any tagging programs that can indicate C-units as opposed to
> sentences.
>
> I look forward to your replies.
>
> Apple Su Qi
>
> ------------------------------
> Sent from Yahoo!<http://us.rd.yahoo.com/mailuk/taglines/isp/control/*http://us.rd.yahoo.com/evt=51949/*http://uk.docs.yahoo.com/mail/winter07.html>- a smarter inbox.
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080221/886f26ac/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list