[Corpora-List] Survey: applications using grammar-based parsers

Tue Mar 31 14:23:45 UTC 2009

Summary of the Parser query

In my original query from March 27th, I asked for references to  
grammar-based parsers. In my query, I excluded Constraint Grammar, as  
I already knew that this framework has achieved the results I asked for:

* Robust parsing results for a large number of languages
(cf. http://en.wikipedia.org/wiki/Constraint_grammar for a list, the  
full-fledged cg grammars on this list with accuracy results better  
than the seemingly magical 97% ceiling of statistical parsers, several  
of these including dependency grammar)
* in use in a wide range of practical applications, commercial and  
open source
(An arbitrary and non-exhaustive list: lingsoft.fi: Finnish, Swedish,  
Norwegian, Danish grammar checker; gramtrans.com: MT between English  
and different Scandinavian languages; visl.sdu.dk: grammar learning  
platform; wiki.apertium.org: MT between a wide range of languages;  
giellatekno.uit.no: Sami parsing and iCALL; connexor.com: Parsing, and  
a wide range of text processing applications.)

----

Here comes a summary of the answers to the query. The answers were  
heterogenous, ranging from a mere url to short, but precise  
descriptions, and reference to articles and documentation. Rather than  
posting the letters themselves, I here make a more uniform summary,  
quoting either from the letters or from their links, as seen fit.  
People interested in the formalisms should of course pass quickly  
through my short characteristics, and to the urls given.

I got responses relating to 5 different (groups of) parsers.

Commercial MT systems
---------------------
(Systran, ATLAS, Duet and so on) rely on hand-written rules.
Here is Francis Bond's evaluation:
"They don't generally publish parse accuracy results, although I  
expect they approach 90-95% on labelled brackets.  Of  course, they  
would be nowhere near this for sentence accuracy, but then no one is.   
Many
of these parsers are inspired by grammars, although they are not  
generally based on a single grammatical theory."

GTA parser
------------
This is a parser for Swedish, a robust shallow parser, which  
identifies phrases with an accuracy of 88%.
260 hand-written rules, written in an object-oriented notation  
resembling C++. GTA does not try to build full trees from a core  
grammar, rather it matches the input string to analysis candidate,  
relying on longest matching.
GTA is used in Grim - a language learning environment for Swedish,
URLs GTA/Grim:
http://www.nada.kth.se/~knutsson/gta.pdf
http://www.nada.kth.se/~jsh/publications/Bigert04m0n.pdf
http://www.nada.kth.se/~knutsson/Karlstrom_Pargman_Knutsson08.pdf

www.langos.ro
-------------
This is a parser framework referred to at its page as "organic  
parsing", and used in MT between Romanian and English. The parser is  
used in a commercial setting (and evidently works well enough to be  
economically viable). The site does not go into the parsing technique,  
but illustrative output referred to at the homepage gives output which  
looks like Constraint Grammar-based dependency analysis.

Link grammar
------------
Link grammar is based upon rules constraining the relations words may  
have to neighbouring words (hence the name). It is implemented for  
English, and in use in a experimental en2de MT system. Link grammar is  
the basis for the grammar checker in AbiWord, and it is used in  
several commercial multi-player games. Link grammar also has an add- 
on, which can create dependency structures.
URLs:
http://www.link.cs.cmu.edu/link/
http://www.abisource.com/projects/link-grammar/
http://opencog.org/wiki/Relex, https://launchpad.net/relex

Weighted Constraint Dependency Grammar
--------------------------------------
The WCDG formalism  describes natural language exclusively as  
dependency structure, i.e. ordered, labelled pairs of words in the  
input text. It performs natural language analysis under the paradigm  
of constraint optimization, where the analysis that best conforms to  
all rules of the grammar is returned. The WCDG formalism has been used  
to make a comprehensive grammar of German (structural recall on 80-90%  
for different text types).
URLs:
http://acl.ldc.upenn.edu/P/P04/P04-3008.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54.8412&rep=rep1&type=pdf

Final comments
--------------
Two of the feedbacks refer to commercial systems (langos, the  
rulebased MT systems, one reviewer also referred to the commercial CG- 
based company Connexor). Whereas being commercial is in itself a  
strong indication of good results (customers will not accept  
malfunction), it also makes it hard to evaluate them: For commercial  
reasons, their source code, or even in some cases the (methodology  
behind their) approach, is kept confidential. Nothing more can thus be  
said about them here.

When I look at the other three parsers, GTA, WCDG, and Link grammar, I  
find that they all bear some reseamblances to the CG framework: The  
parsing is based upon bottom-up local relations (looking at the  
relations the words may have to each other), and they are thus always  
able to come up with an an analysis.

There is an alternative to this apporach, the one building sentence  
filters, much in the same way as finite-state transducers may be seen  
as filters for wordforms (and hence function in spellers). Both fst- 
based syntactic frameworks (Finite state intersection grammar), prolog- 
based grammars, and LFG and HPSG rely upon full parses (linked to a  
top node S, or the lexc variant LEXICON Root). (I know there is work  
going on within e.g. LFG on parsing sentence fragments, but have not  
seen any results from that work.)

These frameworks are missing in my survey, as I also suspected. What I  
had expected was to see some LFG and HPSG version of iCALL programs,  
as the language in pedagogical QA systems may be restricted, thereby  
conpensating for weaker results for unbounded text, but then, these  
parsers would have been excluded by my first criterion. In order to  
analyse unbounded text reliably it thus seems that a framework with  
the properties of the 4 approaches referred to here is needed. That  
fst systems are successful for morphology but not for syntax I see as  
a healthy reminder of the difference between these two domains.

Thanks to all who answered my survey:

Francis Bond, Atanas Chanev, Vlad Gojol, Ola Knutsson, Linas Vepstas,  
Yannick Versley

Greetings,
Trond Trosterud

---------------------

Here is my original survey text:

I am looking for references to grammar-based parsers / analysers
fulfilling the following criteria. They are:
- grammar-based (rule-based) parsers, except Constraint Grammar
parsers, which
- show robust results (say accuracy above 90-95%, rather than, say
around 60-70% for whatever they are supposed to do), and
- are used in full-scale, working applications (rather than toy/test
applications), within e.g.
- iCALL, MT, grammar checking, or the like.

----------------------------------------------------------------------
Trond Trosterud                                        t +47 7764 4763
Institutt for språkvitskap, Det humanistiske fakultet  m +47 950 70140
N-9037 Universitetet i Tromsø, Noreg                   f +47 7764 5216
Trond.Trosterud (a) uit.no              http://www.hum.uit.no/a/trond/
dn------------------------------------------------------------------đŋ

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora