Some grammatical conundrums in Thai
Doug Cooper
doug at th.net
Sun May 28 12:07:09 UTC 2000
At 18:23 25/5/00 -0700, Jim Placzek wrote:
> My main curiosity was really with what can go after day3 in a
>statement. There seems to be a class of words that can go there,
>and it implies that these words may be beyond the "scope" of day3.
> That's not an official question, but if you'd care to comment it may be
>interesting for other Thai language scholars.
Not exactly sure what the question is. Nevertheless, I've put
some data that may help with the answer (or supply examples
for further questions) on line at:
http://seasrc.th.net/corpus/day3.htm
http://seasrc.th.net/corpus/day3.zip (raw contexts, 221K)
The page shows output from one of the tools I've written for Thai
text analysis. It lists all distinct collocates of day3 (1,825 left, and
1,754 right) from the10,643 instances found in a small (3.3 meg),
moderately balanced corpus: about 1 meg each of (least-formal)
message board postings, (mid-fomal) Thai Rath columns, and
(most formal) Thai News Agency broadcasts, with the rest assorted
literary text. The lists look something like this:
(1384) äÁèä´é (445) ¡çä´é (443) ¨Ðä´é (303) ·Õèä´é ... (1,821 more left
col's).
(833) ä´éÃѺ (305) ä´éÁÕ (183) ä´éÇèÒ (179) ... (1,750 more right col's).
I've included Haas POS entries for each collocate (but not the
whole phrase) when available. Highlighting, then clicking, any
phrase brings up all contexts (+/- 20 characters) of the collocate
pair, ordered by frequency of secondary collocates, eg:
86 (ÃѺ) è·Ø¡Çѹ·Ñ駷ÕèÂѧ äÁèä´é ÃѺ¡ÒõѴÊÔ¹¼ÁÇèÒ੾
... 85 more contexts for äÁèä´é + ÃѺ ...
76 (ÁÕ) 99ã99à»ÍÃìà«ç¹â´Â äÁèä´é ÁÕ¢éÍÁÙÅ·ÕèªÑ´à¨¹á¹è
... 75 more contexts for äÁèä´é + ÁÕ ...
There's multiple counting of some entries because I return
both leading and trailing heads _and_ compounds/phrases
(as seen in Haas). Thus, we get both /pay day3/ and
/pen pay day3/, or /day3 yang/ and /day3 yang-ray/.
The raw contexts file has lines of the form:
lead_context<tab>day3<tab>trailing_context
where the contexts are roughly 20 characters each. <L>
indicates a line-break in the original.
In a good-sized sample (140 million chars), day3 gets
about 575,000 hits. It's about the fifth most frequent word in
least-formal text, and about the tenth -- half as common --
in most-formal text (such as news broadcasts relating to the
royal family). Distribution is classically Zipfian (rank X hits
stays roughly constant), with the corollary result that about
half the distinct collocates occur just once.
Please let me know if this can be made more useful in
any way,
Doug
__________________________________________________
1425 VP Tower, 21/45 Soi Chawakun
Rangnam Road, Rajthevi, Bangkok, 10400
doug at th.net (662) 246-8946 fax (662) 246-8789
Southeast Asian Software Research Center, Bangkok
http://seasrc.th.net --> SEASRC Web site
http://seasrc.th.net/sealang --> SEALANG Web site
More information about the Sealang-l
mailing list