Corpora: On-line KWIC system in PHP

Sat Jan 6 08:57:18 UTC 2001

Mark Davies said:

> A while back I posted a notice about a web-accessible corpus of Spanish
> texts that works on more or less the same basis as what you've...

I visited your page and it is also fast from this part of the world (Japan).

> The speed is quite good -- about 1-2 seconds for most searches -- which
> compares nicely with the solution that you've suggested.  In addition, the
> SQL Server approach is quite scalable.  Searches on a 200 million word
> corpus of Modern Portuguese (http://mdavies.for.ilstu.edu/corpus/publico)
> are nearly as fast -- less than 5-10 seconds for nearly all searches.

Well, for small corpora, say few millions words processing flat files in
ASCII is not so slow. After your mail, I have tested (using the function
microtime() ) the response time (from the keyword input until the last
concordance line is printed out) for Don Quijote (about 2Mb) and the results
were about 0.4sec to 0.6sec (in "plain" PHP4 without optimizer) or 4 seconds
(in older PHP3, unfortunately it is the test page that I posted). Getting
more than 1200 matches of the Spanish article "el" in a smaller file (La
Gaviota, 0.5Mb) took about 1.3 sec. (PHP3) and about 0.8 sec. for not so
frequent keywords. But with PHP4 I had a response time of less than 0.07
sec. in almost any case. In a near future I will (try to) install the Zend
Optimizer, Cache and Loader so the response times will be faster even with
heavy traffic. I will report the results in this list if there is some
interest.

> What would be interesting is to use the PHP/mySQL approach with a large
> database -- 50 million words or more -- and see what the performance is
> like.  If it's still fast -- like what you have right now -- then I think
> that it would be an ideal solution for the NT platform.  And of course one
> of the main advantages of the PHP/mySQL solution is the cost (or lack of
> cost :), as compared to the NT Server / SQL Server approach, which can be
a
> bit pricey.

Yes it would be very interesting. I don't have such a big corpus, but in the
near future I would like to test it if not with a Spanish corpora, I could
manage it with any other available corpora, perhaps Japanese. Of course,
when PHP is used with MySQL, code must be different in order to get a better
performance.

As you pointed out, one of the main advantages of this type of approach is
the lack of cost. Any student can install the whole system in a 500$ PC
(just hardaware) and it works flawlessly. It also works in Windows98 and
other platforms.

One other interesting point is that propietary software could be like a
black box, but when using open source software you know what is inside and
can modify it (only sometimes;-).  Perhaps we will have to try different
approaches for different purposes.

Antonio Ruiz Tinoco
Sophia University, Tokyo
a-ruiz at hoffman.cc.sophia.ac.jp