[An-lang] AN corpora

Andrew Koontz-Garboden andrewkg at csli.stanford.edu
Sun May 30 15:29:57 UTC 2004


Hi.  I'm not sure what kind of searches you're wanting to do, but
depending on the level of sophistication you need, a good bit can be done
simply with google doing site specific searches.  E.g., I've used google
to search the online Tongan newspaper Taimi Tonga.  Of course, there's not
*tons* of text there (so, I don't take the absence of something to mean
much, but presence is certainly useful), but it's something.  The way you
do this from google is as follows.

Pretend I want to find strings of kuo 'osi.  I might do the following:

"kuo 'osi" site:www.planet-tonga.com

The quotes indicate that you're looking for kuo right next to 'osi.  The
bit after the site tells the search engine to restrict the search to that
particular domain.  If you put the search without the quotes, then the
search engine will search for kuo near 'osi, rather than right next to it.
The above search gave 232 hits (or so google claims; I didn't actually
look through them).  You can see them here:

http://www.google.com/search?hl=en&lr=&ie=UTF-8&q=%22kuo+%27osi%22+site%3Awww.planet-tonga.com

This method should work for any language for which there's an online
newspaper or website with substantial content.  Of course, it only works
for strings of text---I'd certainly love to have a Tongan treebank, or a
POS tagged corpus...

Andrew Koontz-Garboden



On Sat, 29 May 2004, Ross Clark (FOA DALSL) wrote:

> Someone asked me whether there are word frequency statistics available for
> Samoan, such as exist for English and other big languages. I think probably
> not, and further it occurred to me that such statistics depend on a corpus
> of the language in question -- nowadays assumed to be computer-searchable.
> Corpus linguistics seems to be pretty trendy in English right now. But I
> wonder whether there are comparable bodies of text for any Austronesian
> languages? At one time the Maori Studies people here had at least the
> beginnings of one, and I believe the Maori Newspapers project aims
> eventually to have a searchable online corpus. Any other news?
>
> Ross Clark
> _______________________________________________
> An-lang mailing list
> An-lang at anu.edu.au
> http://mailman.anu.edu.au/mailman/listinfo/an-lang
>
_______________________________________________
An-lang mailing list
An-lang at anu.edu.au
http://mailman.anu.edu.au/mailman/listinfo/an-lang



More information about the An-lang mailing list