[Corpora-List] Survey: applications using grammar-based parsers
Trond Trosterud
trond.trosterud at uit.no
Tue Mar 31 14:23:45 UTC 2009
Summary of the Parser query
In my original query from March 27th, I asked for references to
grammar-based parsers. In my query, I excluded Constraint Grammar, as
I already knew that this framework has achieved the results I asked for:
* Robust parsing results for a large number of languages
(cf. http://en.wikipedia.org/wiki/Constraint_grammar for a list, the
full-fledged cg grammars on this list with accuracy results better
than the seemingly magical 97% ceiling of statistical parsers, several
of these including dependency grammar)
* in use in a wide range of practical applications, commercial and
open source
(An arbitrary and non-exhaustive list: lingsoft.fi: Finnish, Swedish,
Norwegian, Danish grammar checker; gramtrans.com: MT between English
and different Scandinavian languages; visl.sdu.dk: grammar learning
platform; wiki.apertium.org: MT between a wide range of languages;
giellatekno.uit.no: Sami parsing and iCALL; connexor.com: Parsing, and
a wide range of text processing applications.)
----
Here comes a summary of the answers to the query. The answers were
heterogenous, ranging from a mere url to short, but precise
descriptions, and reference to articles and documentation. Rather than
posting the letters themselves, I here make a more uniform summary,
quoting either from the letters or from their links, as seen fit.
People interested in the formalisms should of course pass quickly
through my short characteristics, and to the urls given.
I got responses relating to 5 different (groups of) parsers.
Commercial MT systems
---------------------
(Systran, ATLAS, Duet and so on) rely on hand-written rules.
Here is Francis Bond's evaluation:
"They don't generally publish parse accuracy results, although I
expect they approach 90-95% on labelled brackets. Of course, they
would be nowhere near this for sentence accuracy, but then no one is.
Many
of these parsers are inspired by grammars, although they are not
generally based on a single grammatical theory."
GTA parser
------------
This is a parser for Swedish, a robust shallow parser, which
identifies phrases with an accuracy of 88%.
260 hand-written rules, written in an object-oriented notation
resembling C++. GTA does not try to build full trees from a core
grammar, rather it matches the input string to analysis candidate,
relying on longest matching.
GTA is used in Grim - a language learning environment for Swedish,
URLs GTA/Grim:
http://www.nada.kth.se/~knutsson/gta.pdf
http://www.nada.kth.se/~jsh/publications/Bigert04m0n.pdf
http://www.nada.kth.se/~knutsson/Karlstrom_Pargman_Knutsson08.pdf
www.langos.ro
-------------
This is a parser framework referred to at its page as "organic
parsing", and used in MT between Romanian and English. The parser is
used in a commercial setting (and evidently works well enough to be
economically viable). The site does not go into the parsing technique,
but illustrative output referred to at the homepage gives output which
looks like Constraint Grammar-based dependency analysis.
Link grammar
------------
Link grammar is based upon rules constraining the relations words may
have to neighbouring words (hence the name). It is implemented for
English, and in use in a experimental en2de MT system. Link grammar is
the basis for the grammar checker in AbiWord, and it is used in
several commercial multi-player games. Link grammar also has an add-
on, which can create dependency structures.
URLs:
http://www.link.cs.cmu.edu/link/
http://www.abisource.com/projects/link-grammar/
http://opencog.org/wiki/Relex, https://launchpad.net/relex
Weighted Constraint Dependency Grammar
--------------------------------------
The WCDG formalism describes natural language exclusively as
dependency structure, i.e. ordered, labelled pairs of words in the
input text. It performs natural language analysis under the paradigm
of constraint optimization, where the analysis that best conforms to
all rules of the grammar is returned. The WCDG formalism has been used
to make a comprehensive grammar of German (structural recall on 80-90%
for different text types).
URLs:
http://acl.ldc.upenn.edu/P/P04/P04-3008.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54.8412&rep=rep1&type=pdf
Final comments
--------------
Two of the feedbacks refer to commercial systems (langos, the
rulebased MT systems, one reviewer also referred to the commercial CG-
based company Connexor). Whereas being commercial is in itself a
strong indication of good results (customers will not accept
malfunction), it also makes it hard to evaluate them: For commercial
reasons, their source code, or even in some cases the (methodology
behind their) approach, is kept confidential. Nothing more can thus be
said about them here.
When I look at the other three parsers, GTA, WCDG, and Link grammar, I
find that they all bear some reseamblances to the CG framework: The
parsing is based upon bottom-up local relations (looking at the
relations the words may have to each other), and they are thus always
able to come up with an an analysis.
There is an alternative to this apporach, the one building sentence
filters, much in the same way as finite-state transducers may be seen
as filters for wordforms (and hence function in spellers). Both fst-
based syntactic frameworks (Finite state intersection grammar), prolog-
based grammars, and LFG and HPSG rely upon full parses (linked to a
top node S, or the lexc variant LEXICON Root). (I know there is work
going on within e.g. LFG on parsing sentence fragments, but have not
seen any results from that work.)
These frameworks are missing in my survey, as I also suspected. What I
had expected was to see some LFG and HPSG version of iCALL programs,
as the language in pedagogical QA systems may be restricted, thereby
conpensating for weaker results for unbounded text, but then, these
parsers would have been excluded by my first criterion. In order to
analyse unbounded text reliably it thus seems that a framework with
the properties of the 4 approaches referred to here is needed. That
fst systems are successful for morphology but not for syntax I see as
a healthy reminder of the difference between these two domains.
Thanks to all who answered my survey:
Francis Bond, Atanas Chanev, Vlad Gojol, Ola Knutsson, Linas Vepstas,
Yannick Versley
Greetings,
Trond Trosterud
---------------------
Here is my original survey text:
I am looking for references to grammar-based parsers / analysers
fulfilling the following criteria. They are:
- grammar-based (rule-based) parsers, except Constraint Grammar
parsers, which
- show robust results (say accuracy above 90-95%, rather than, say
around 60-70% for whatever they are supposed to do), and
- are used in full-scale, working applications (rather than toy/test
applications), within e.g.
- iCALL, MT, grammar checking, or the like.
----------------------------------------------------------------------
Trond Trosterud t +47 7764 4763
Institutt for språkvitskap, Det humanistiske fakultet m +47 950 70140
N-9037 Universitetet i Tromsø, Noreg f +47 7764 5216
Trond.Trosterud (a) uit.no http://www.hum.uit.no/a/trond/
dn------------------------------------------------------------------đŋ
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list