16.235, Software: Spanish corpus with integrated search functions

LINGUIST List linguist at linguistlist.org
Tue Jan 25 17:29:19 UTC 2005


LINGUIST List: Vol-16-235. Tue Jan 25 2005. ISSN: 1068 - 4875.

Subject: 16.235, Software: Spanish corpus with integrated search functions

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org)
        Sheila Collberg, U of Arizona
        Terry Langendoen, U of Arizona

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Neil Salmond <neil at linguistlist.org>
================================================================

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.


===========================Directory==============================

1)
Date: 22-Jan-2005
From: Craig Schulenberg < cschulen2004 at aol.com >
Subject: Spanish corpus with integrated search functions

	
-------------------------Message 1 ----------------------------------
Date: Tue, 25 Jan 2005 12:28:13
From: Craig Schulenberg < cschulen2004 at aol.com >
Subject: Spanish corpus with integrated search functions


As an outgrowth of our efforts to develop a Parser/Tagger for Spanish we
have created a prototype program (Literature Assistant) which integrates a
corpus (which has been processed by our Parser) with a 'Reader' interface
and some powerful search functions. This program is entirely self-contained
and employs an extremely fast database of our own design.  We have no
intentions of developing this program into a commercial product; rather, it
is a research tool which is of great assistance to us in identifying the
(many) weaknesses in our Parser, and in our Dictionary.  We would
appreciate feedback on the design and features of this software approach,
and would be interested in collaborative efforts on Parser/Tagger
implementations and corpus search algorithms.  The Literature Assistant
runs in a DOS window on a PC.

The corpus includes 700 works (mostly novels), and menu screens allow
selecting an author, a work, and (finally) a chapter or bookmark.  The user
then sees a 'Reader' screen which shows the complete text, and allows rapid
page up/down, top-of-text, and end-of-text positioning.  When a word or
phrase is highlighted (by moving the cursor), the definition is shown
(drawn from our 48000 word Dictionary).  Conjugated verb forms are
referenced back to their infinitives and their definition (based on our
13,066 verb database).  If a highlighted word is selected, a second screen
immediately appears which shows 'all' sentences in the corpus that use the
same word/verb.  On this second screen any of the cross-referenced works
can then be 'jumped to' by selecting that particular sentence. In this case
the user is positioned in the Reader Screen for this newly selected work.
In this way all of the texts may be traversed by following these links
between the two screens.

The second screen (Sentence Screen) permits corpus searches.  For example,
the query 'gustar(se *)' will find all forms of the reflexive 'se' followed
by any conjugated form of 'gustar'.  All sentences (and their title and
author) are shown that meet the search criteria.  A special feature
(Jot-a-Note) is provided which makes it easy to generate a textual
commentary on any item observed on any screen.  This output file can then
be processed in any text editor.

It is immediately clear that our Parser/Tagger is only 90-95% accurate at
this point, and that our Dictionary is too small too do proper justice to
these kinds of texts.  Nonetheless, we believe that this is an interesting
approach not only to corpus linguistics, but also to making Spanish
literature more accessible and interactive.

Linguistic Field(s): Text/Corpus Linguistics





-----------------------------------------------------------
LINGUIST List: Vol-16-235	

	



More information about the LINGUIST mailing list