[Lexicog] word frequency
Jan F. Ullrich
jfu at CENTRUM.CZ
Tue Sep 13 20:14:07 UTC 2005
> [mailto:lexicographylist at yahoogroups.com] On Behalf Of
> fieldworks_support at sil.org
>
> 1) You could sort your word list by the numeric order of the
> frequency counts. Do this by changing the sorting of the
> wordlist database to use the \cz field (Count Sort). If you
> want a descending sort with the highest numbers first, make a
> language encoding where the digits are included in descending
> order, ie 9 8 7 6 5 4 3 2 1 0 and use that language encoding
> for your \cz field. The cz field will work better for
> sorting than the \c field, because the numbers in the \c
> field don't have leading zeroes, and Shoebox/Toolbox won't
> sort them correctly. It would sort '100' right next to '10'
> and right next to '1' because they both begin with a '1'.
This is pretty much the method I used for choosing the most frequent words
for the student dictionary. Then I added many of the less frequent but
important words by checking certain semantic domains to make the word
selection of the student dictionary balanced.
But what I am searching for now is a way to automate the process of marking
a certain number of the most frequent words. If I understand you correctly
you suggest that I would add the mark in front of the 4,000 most frequent
words manually. But that is what I would like to avoid if possible.
> 2) Then from this list, you could see which words in the
> dictionary need to be marked as being among the most
> frequent. To insert a symbol ahead of the lemma, just add
> your symbol ahead of the lemma, and in the vernacular
> language sort order, place your symbol in the "ignore
> characters" list. This should keep it from changing how the
> lemmas are sorted. Or else, if you are using the \lc
> (lexical citation) field but sorting on the lx field, you
> could put the symbol in the \lc field and not in the lx, and
> this should also sort correctly.
Yes, I considered this option. But I thought it would not work because in
the output the frequency symbol would be indented on the lemma position,
while it should be slightly left of the paragraph keeping all the lemmas on
the same indention position (I hope I am using the proper English terms
here). But I am thinking that maybe I could assign the symbol a special
style and indent it a little more to the left.
A somewhat related thought:
The text corpus we have been using for Lakota dictionary has texts from
various time periods from early missionary sources in second half of 19.
century up to now. Each group of texts is marked by date and author. So if I
export the word-list from a 1890s text collection and another one from 2005
text collection, I can compare the frequency in those periods. I wish that
Toolbox had a tool which would automatically add such information into the
dictionary file - for instance Toolbox would compare the lemma list of the
1880 Bible with the dictionary file and add "Bi1880" into the \so field of
all the lemmas found in that word list. Then it would do the same with a
text collection from 1970s and so on. Having such data in the \so field
would help filtering words based on their usage in time, in different styles
and by different authors.
Our project's programmer is working towards the comparison of lemma lists
with the dictionary database using scripts outside Toolbox, but I thought I
would ask around if anyone else had some experiences with something similar.
Jan F. Ullrich
Lakota Language Consortium
www.lakhota.org
------------------------ Yahoo! Groups Sponsor --------------------~-->
Get fast access to your favorite Yahoo! Groups. Make Yahoo! your home page
http://us.click.yahoo.com/dpRU5A/wUILAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~->
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list