Corpora: On-line KWIC system in PHP

Mark Davies mdavies at ilstu.edu
Fri Jan 5 14:42:21 UTC 2001


[Sorry for any duplicate messages.  My email client is behaving strangely
today]

>I am interested in processing corpora (mainly in Spanish and Japanese) and
>now I am preparing some exercises for my students for the new course
>beginning next April. What I am trying to do is a Web KWIC system using only
>(or mainly) PHP.
>
>Is there anybody using PHP for this purpose? For big corpora I am developing
>a system with PHP and MySQL, and I think that its response time is quite
>fast compared with PERL even without a backend database.

A while back I posted a notice about a web-accessible corpus of Spanish
texts that works on more or less the same basis as what you've
proposed.  The corpus is composed of 3,000,000 words in nearly 200 texts
from the 1200s to the 1900s (including 1,000,000 words from Modern Spanish,
divided equally among LatAm-Spoken, LatAm-Written, Spain-Spoken,
Spain-Written).  The URL is:

         http://mdavies.for.ilstu.edu/corpus

The data is stored in a SQL Server database and is indexed via the "Full
Text" indexing in SQL Server, which allows for proximity searches and
searches for several types of word forms.  The database is linked to the
web via Active Server Pages, including ADO (Active Data Objects) and VBScript.

The speed is quite good -- about 1-2 seconds for most searches -- which
compares nicely with the solution that you've suggested.  In addition, the
SQL Server approach is quite scalable.  Searches on a 200 million word
corpus of Modern Portuguese (http://mdavies.for.ilstu.edu/corpus/publico)
are nearly as fast -- less than 5-10 seconds for nearly all searches.

What would be interesting is to use the PHP/mySQL approach with a large
database -- 50 million words or more -- and see what the performance is
like.  If it's still fast -- like what you have right now -- then I think
that it would be an ideal solution for the NT platform.  And of course one
of the main advantages of the PHP/mySQL solution is the cost (or lack of
cost :), as compared to the NT Server / SQL Server approach, which can be a
bit pricey.

Mark Davies
Illinois State University



More information about the Corpora mailing list