13.1458, Software: CLaRK System-Corpera Development Tool

LINGUIST List linguist at linguistlist.org
Thu May 23 16:15:43 UTC 2002


LINGUIST List:  Vol-13-1458. Thu May 23 2002. ISSN: 1068-4875.

Subject: 13.1458, Software: CLaRK System-Corpera Development Tool

Moderators: Anthony Aristar, Wayne State U.<aristar at linguistlist.org>
            Helen Dry, Eastern Michigan U. <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org):
	Simin Karimi, U. of Arizona
	Terence Langendoen, U. of Arizona

Consulting Editor:
        Andrew Carnie, U. of Arizona <carnie at linguistlist.org>

Editors (linguist at linguistlist.org):
	Karen Milligan, WSU 		Naomi Ogasawara, EMU
	James Yuells, EMU		Marie Klopfenstein, WSU
	Michael Appleby, EMU		Heather Taylor, EMU
	Ljuba Veselinova, Stockholm U.	Richard John Harvey, EMU
	Dina Kapetangianni, EMU		Renee Galvis, WSU
	Karolina Owczarzak, EMU

Software: John Remmers, E. Michigan U. <remmers at emunix.emich.edu>
          Gayathri Sriram, E. Michigan U. <gayatri at linguistlist.org>

Home Page:  http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.



Editor for this issue: James Yuells <james at linguistlist.org>

=================================Directory=================================

1)
Date:  Mon, 20 May 2002 19:55:39 +0300
From:  "Kiril Simov" <kivs at bgcict.acad.bg>
Subject:  CLaRK System - an XML-based System for Corpora Development

-------------------------------- Message 1 -------------------------------

Date:  Mon, 20 May 2002 19:55:39 +0300
From:  "Kiril Simov" <kivs at bgcict.acad.bg>
Subject:  CLaRK System - an XML-based System for Corpora Development

Dear List members,

I would like to announce the CLaRK System - an XML-based System
for Corpora Development. It is available on the web page of
the BulTreeBank Project:

http://www.bultreebank.org/

Please, follow the "CLaRK System" link and then Download.

The system is implemented in JAVA.

Short description:

CLaRK is an XML-based software system for corpora development.
The main aim behind the design of the system is the minimization
of human intervention during the creation of language resources.
It incorporates several technologies: (1) XML technology;
(2) Unicode; (3) Regular Cascade Grammars;
(4) Constraints over XML Documents.

For document management, storing and querying, we chose the
XML technology because of its popularity and its ease of
understanding. The core of CLaRK is an XML Editor, which is
the main interface to the system. Besides the XML language itself,
we implemented an XPath language for navigation in
documents and an XSLT language for transformation of XML documents.

For multilingual processing tasks, CLaRK is based on an
Unicode encoding of the information inside the system.
There is a mechanism for the creation of a hierarchy of
tokenisers. They can be attached to the elements in the DTDs
and in this way there are different tokenisers for different
parts of the documents.

The basic mechanism of CLaRK for linguistic processing of
text corpora is the cascade regular grammar processor.
The main challenge to the grammars in question is how to apply
them on XML encoding of the linguistic information. The system
offers a solution using an XPath language for constructing
the input word to the grammar and an XML encoding of the
categories of the recognised words.

Several mechanisms for imposing constraints over XML
documents are available. The constraints cannot be stated by
the standard XML technology. The following types of constraints
are implemented in CLaRK: (1) Regular expression constraints -
additional constraints over the content of given elements based
on a context; (2) Number restriction constraints - cardinality
constraints over the content of a document; (3) Value constraints -
restriction of the possible content or parent of an element in
a document based on a context. The constraints are used in
two modes: checking the validity of a document regarding a set
of constraints; supporting the linguist in his/her work during
the building of a corpus. The first mode allows the creation of
constraints for the validation of a corpus according to given
requirements. The second mode helps the underlying strategy of
minimisation of the human labour.


With best regards,

Kiril

- ---------------------------------------------------------------
Kiril Simov
BulTreeBank Project
Linguistic Modelling Laboratory, CLPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
E-mail: kivs at bgcict.acad.bg
Web: http://www.bultreebank.org/
- ---------------------------------------------------------------

---------------------------------------------------------------------------
LINGUIST List: Vol-13-1458



More information about the LINGUIST mailing list