<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Jan F. Ullrich wrote:
<blockquote cite="mid001101c5b89f$b20881e0$0201a8c0@ullrichnet"
type="cite">
<pre wrap="">
</pre>
<blockquote type="cite">
<pre wrap="">[<a class="moz-txt-link-freetext" href="mailto:lexicographylist@yahoogroups.com">mailto:lexicographylist@yahoogroups.com</a>] On Behalf Of
<a class="moz-txt-link-abbreviated" href="mailto:fieldworks_support@sil.org">fieldworks_support@sil.org</a>
1) You could sort your word list by the numeric order of the
frequency counts. Do this by changing the sorting of the
wordlist database to use the \cz field (Count Sort). If you
want a descending sort with the highest numbers first, make a
language encoding where the digits are included in descending
order, ie 9 8 7 6 5 4 3 2 1 0 and use that language encoding
for your \cz field. The cz field will work better for
sorting than the \c field, because the numbers in the \c
field don't have leading zeroes, and Shoebox/Toolbox won't
sort them correctly. It would sort '100' right next to '10'
and right next to '1' because they both begin with a '1'.
</pre>
</blockquote>
<pre wrap=""><!---->
This is pretty much the method I used for choosing the most frequent words
for the student dictionary. Then I added many of the less frequent but
important words by checking certain semantic domains to make the word
selection of the student dictionary balanced.
But what I am searching for now is a way to automate the process of marking
a certain number of the most frequent words. If I understand you correctly
you suggest that I would add the mark in front of the 4,000 most frequent
words manually. But that is what I would like to avoid if possible.
</pre>
</blockquote>
The way I would do this is to export my data file into 2 separate
files; 1 which would include all the records which are the most
frequent and need to get that special symbol, and the other which would
not get that special symbol. I assume that you have a means to do that,
such as a frequency field or other field which allows you to filter
your data on that criteria. (Make one filter to gather the first
criteria, then put a NOT in front
of the same filter definition to get the reverse effect.) Once you have
these 2 separate files you can then process them with another word
processor (I use Ultra Edit, but there are free ones which could also
do this) and do a search and replace on the \lx field to add a special
symbol there (so you end up with something like: <br>
\lx * blahblahblah). Then you can copy your two files together again
and reopen it in Toolbox. Toolbox will resort your file and you will
have achieved what you want.<br>
<br>
<blockquote cite="mid001101c5b89f$b20881e0$0201a8c0@ullrichnet"
type="cite">
<pre wrap=""></pre>
<blockquote type="cite">
<pre wrap="">2) Then from this list, you could see which words in the
dictionary need to be marked as being among the most
frequent. To insert a symbol ahead of the lemma, just add
your symbol ahead of the lemma, and in the vernacular
language sort order, place your symbol in the "ignore
characters" list. This should keep it from changing how the
lemmas are sorted. Or else, if you are using the \lc
(lexical citation) field but sorting on the lx field, you
could put the symbol in the \lc field and not in the lx, and
this should also sort correctly.
</pre>
</blockquote>
<pre wrap=""><!---->
Yes, I considered this option. But I thought it would not work because in
the output the frequency symbol would be indented on the lemma position,
while it should be slightly left of the paragraph keeping all the lemmas on
the same indention position (I hope I am using the proper English terms
here). But I am thinking that maybe I could assign the symbol a special
style and indent it a little more to the left.
</pre>
</blockquote>
This is what I would experiment with as well. You might have to make a
copy of the base paragraph style which each dictionary entry is based
on and then change the indentation to accomodate that symbol. Then you
should be able to do a search and replace in Word to find each record
which contains your (unique!) symbol and apply the new paragraph style
to that entry. It won't work to apply only a character style to that
symbol because you can't apply indentation rules to character styles.<br>
<br>
<blockquote cite="mid001101c5b89f$b20881e0$0201a8c0@ullrichnet"
type="cite">
<pre wrap="">
A somewhat related thought:
The text corpus we have been using for Lakota dictionary has texts from
various time periods from early missionary sources in second half of 19.
century up to now. Each group of texts is marked by date and author. So if I
export the word-list from a 1890s text collection and another one from 2005
text collection, I can compare the frequency in those periods. I wish that
Toolbox had a tool which would automatically add such information into the
dictionary file - for instance Toolbox would compare the lemma list of the
1880 Bible with the dictionary file and add "Bi1880" into the \so field of
all the lemmas found in that word list. Then it would do the same with a
text collection from 1970s and so on. Having such data in the \so field
would help filtering words based on their usage in time, in different styles
and by different authors.
Our project's programmer is working towards the comparison of lemma lists
with the dictionary database using scripts outside Toolbox, but I thought I
would ask around if anyone else had some experiences with something similar.</pre>
</blockquote>
A modified version of the cc table I mentioned on your last post would
accomplish this task as well. I haven't written it yet, but it could be
done. Contact me off-list if you would like such a cc table written. It
should be a fairly straight forward process to gather the data from one
source field and add to the \so field in another file. You would be
able to run this cc table on as many wordlists as you need or want,
each time adding the source info to your master dictionary file.<br>
<br>
Norbert Rennert<br>
Canada Institute of Linguistics<br>
<!-- |**|begin egp html banner|**| -->
<br>
<div style="text-align:center; color:#909090; width:500px;">
<hr style="border-bottom:1px; width:500px; text-align:left;">
<tt>YAHOO! GROUPS LINKS</tt>
</div>
<br>
<ul>
<tt><li type=square> Visit your group "<a href="http://groups.yahoo.com/group/lexicographylist">lexicographylist</a>" on the web.<br> </tt>
<tt><li type=square> To unsubscribe from this group, send an email to:<br> <a href="mailto:lexicographylist-unsubscribe@yahoogroups.com?subject=Unsubscribe">lexicographylist-unsubscribe@yahoogroups.com</a><br> </tt>
<tt><li type=square> Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo! Terms of Service</a>.</tt>
</ul>
<br>
<div style="text-align:center; color:#909090; width:500px;">
<hr style="border-bottom:1px; width:500px; text-align:left;">
</div>
</br>
<!-- |**|end egp html banner|**| -->
</body>
</html>