Arabic-L:LING:Arabic Corpus Search Tool
Dilworth Parkinson
dilworth_parkinson at byu.edu
Tue Jun 21 17:51:46 UTC 2005
------------------------------------------------------------------------
-
Arabic-L: Tue 21 Jun 2005
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic Corpus Search Tool
-------------------------Messages-----------------------------------
1)
Date: 21 Jun 2005
From:Dilworth Parkinson
Subject:Arabic Corpus Search Tool
I would like to announce the availability of an Arabic Corpus Search
Tool on the web. This tool is part of a development project that
eventually will allow searches of the tagged Arabic Treebank corpus.
However, while that is being developed, I am making available this
way of searching selected untagged corpora. This is provided now as
a kind of beta for people to try out and make comments on. It is
intended for use by students as part of advanced Arabic classes,
giving them access to many examples of particular structures and
words of interest. It could also be of use to researchers with
lexicographical and other similar interests.
Caveat: I find this tool to be extremely useful, but ALL results
should be taken with a 'grain of salt'. Arabic morphology is complex
and the addition of normal prefixes and suffixes to any particular
word will often create forms that are identical to other words. All
results should be checked for accuracy.
The web site 'works', but is not yet actually 'designed'. We are
working on that now, so forgive the 'klunkiness.'
The site allows you to look for a word or string (or regular
expression), typed in either Arabic or transliteration, and it lists
all the examples found in the corpus being searched, with the
reference, the 10 words before and the 10 words after. It sorts the
results either by the word before or the word after, and if desired
can present the word immediately before or immediately after in a
table sorted by frequency, so you can know the most important
collocations immediately.
You identify your search word with a particular part of speech label,
and that determines the prefixes and suffixes the search includes.
If you want to 'do your own' pattern with regular expressions, then
you should choose 'adv' which allows only wa- and fa-, or 'string'
which doesn't allow anything.
To get an idea of the usefulness of tools like this in the
understanding of collocations of words in particular corpora, go into
the Quran corpus and search for fwz or ywm and notice the words that
come after (you can choose 'sort by word after' to see it even better
with ywm).
To get an idea of the usefulness of it for finding grammatical
constructions and for advanced students, here is an example from a
recent advanced class I taught. We encountered the structure illa
wa- (as in maa min Siini illa wa yajmac bayna waZiifatay camal --from
the Hayat) and some of the students really thought it was odd. I
went to the corpus and from the Hayat and Ahram immediately was able
to come up with about 20 clear examples, that not only made the
structure clear, but that made it familiar, so the next time the
students encountered it they rather excitedly recognized it. To be
able to search for such a thing you need to be able to do regular
expressions, but if you want to try it, type in the following either
in the Ahram or Hayat site (and you must be willing to wait a
significant amount of time for a search like this):
[EA]lA\sw\w+
and choose 'adv' as the part of speech. You will get several 'false
hits' but also a plethora of examples of the structure in question.
To use this tool, go to:
arcorpus.byu.edu
and choose either: Arabic Treebank, Ahram 1999, Hayat 1997, or Quran.
If you are using transliteration, and just want to try a few things
out without having to really figure it out, try choosing noun, and
typing in ktAb (capitalization is important), or choose femnoun and
type wlAyQ or gnymQ, or choose verb and type in vhb or gAdr.
The server we are using balks at serving up very large html pages, so
when the results get too long, it pretends to keep chugging away, but
in fact will never finish giving you that page. This will hopefully
eventually be fixed, but not soon. If you are searching for
something with many thousands of examples, you need to limit what
comes at once by limiting the number that show at once, and not
showing some of the tables. The table at the end of forms (which you
cannot limit) is sorted by frequency, so if the server can't get done
with that, at least you know that you have the most frequent forms
listed. The Quran and Treebank corpora are quite small (for now for
the Treebank) and thus requests should be answered quickly. The
Ahram and Hayat corpora are relatively large, so you will be waiting
about 10 seconds (or more) for each search. If you want quick,
modern results, you will be able to use the Treebank to get a lot of
information without a lot of wait time. The Ahram and Hayat corpora,
however, do have an added table at the beginning which divides the
results by genre, which can be helpful for some purposes (i.e. to
know that a particular word is used more in Sports pages, or in
letters to the editor, than in news items).
The program should work on any browser that displays Arabic. If you
get junk instead of Arabic, set your browser to utf-8 (it should
happen automatically on most browsers). If you have a fairly wide
screen, the lines of text should all be on one line, but if not there
will be wrapping in each section of the table. The fonts look better
on a Mac than a PC, but there doesn't seem to be anything I can do
about that at the moment. Until better designed, you may have to
pull down the line dividing the upper and lower half of the screen so
you can see the whole 'form'.
Anyway, if you are interested in such things, check out this site,
and let me know what you think.
Dil Parkinson
------------------------------------------------------------------------
--
End of Arabic-L: 21 Jun 2005
More information about the Arabic-l
mailing list