<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Jan F. Ullrich wrote:

<blockquote cite="mid001101c5b89f$b20881e0$0201a8c0@ullrichnet"

 type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">[<a class="moz-txt-link-freetext" href="mailto:lexicographylist@yahoogroups.com">mailto:lexicographylist@yahoogroups.com</a>] On Behalf Of 

<a class="moz-txt-link-abbreviated" href="mailto:fieldworks_support@sil.org">fieldworks_support@sil.org</a>

1) You could sort your word list by the numeric order of the 

frequency counts. Do this by changing the sorting of the 

wordlist database to use the \cz field (Count Sort). If you 

want a descending sort with the highest numbers first, make a 

language encoding where the digits are included in descending 

order, ie 9 8 7 6 5 4 3 2 1 0 and use that language encoding 

for your \cz field.  The cz field will work better for 

sorting than the \c field, because the numbers in the \c 

field don't have leading zeroes, and Shoebox/Toolbox won't 

sort them correctly. It would sort '100' right next to '10' 

and right next to '1' because they both begin with a '1'.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

This is pretty much the method I used for choosing the most frequent words

for the student dictionary. Then I added many of the less frequent but

important words by checking certain semantic domains to make the word

selection of the student dictionary balanced.

But what I am searching for now is a way to automate the process of marking

a certain number of the most frequent words. If I understand you correctly

you suggest that I would add the mark in front of the 4,000 most frequent

words manually. But that is what I would like to avoid if possible.

  </pre>

</blockquote>

The way I would do this is to export my data file into 2 separate

files; 1 which would include all the records which are the most

frequent and need to get that special symbol, and the other which would

not get that special symbol. I assume that you have a means to do that,

such as a frequency field or other field which allows you to filter

your data on that criteria. (Make one filter to gather the first

criteria, then put a NOT in front

of the same filter definition to get the reverse effect.) Once you have

these 2 separate files you can then process them with another word

processor (I use Ultra Edit, but there are free ones which could also

do this) and do a search and replace on the \lx field to add a special

symbol there (so you end up with something like: <br>

\lx * blahblahblah). Then you can copy your two files together again

and reopen it in Toolbox. Toolbox will resort your file and you will

have achieved what you want.<br>

<br>

<blockquote cite="mid001101c5b89f$b20881e0$0201a8c0@ullrichnet"

 type="cite">

  <pre wrap=""></pre>

  <blockquote type="cite">

    <pre wrap="">2) Then from this list, you could see which words in the 

dictionary need to be marked as being among the most 

frequent.  To insert a symbol ahead of the lemma, just add 

your symbol ahead of the lemma, and in the vernacular 

language sort order, place your symbol in the "ignore 

characters" list. This should keep it from changing how the 

lemmas are sorted.  Or else, if you are using the \lc 

(lexical citation) field but sorting on the lx field, you 

could put the symbol in the \lc field and not in the lx, and 

this should also sort correctly.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Yes, I considered this option. But I thought it would not work because in

the output the frequency symbol would be indented on the lemma position,

while it should be slightly left of the paragraph keeping all the lemmas on

the same indention position (I hope I am using the proper English terms

here).  But I am thinking that maybe I could assign the symbol a special

style and indent it a little more to the left.

  </pre>

</blockquote>

This is what I would experiment with as well. You might have to make a

copy of the base paragraph style which each dictionary entry is based

on and then change the indentation to accomodate that symbol. Then you

should be able to do a search and replace in Word to find each record

which contains your (unique!) symbol and apply the new paragraph style

to that entry. It won't work to apply only a character style to that

symbol because you can't apply indentation rules to character styles.<br>

<br>

<blockquote cite="mid001101c5b89f$b20881e0$0201a8c0@ullrichnet"

 type="cite">

  <pre wrap="">

A somewhat related thought:

The text corpus we have been using for Lakota dictionary has texts from

various time periods from early missionary sources in second half of 19.

century up to now. Each group of texts is marked by date and author. So if I

export the word-list from a 1890s text collection and another one from 2005

text collection, I can compare the frequency in those periods. I wish that

Toolbox had a tool which would automatically add such information into the

dictionary file - for instance Toolbox would compare the lemma list of the

1880 Bible with the dictionary file and add "Bi1880" into the \so field of

all the lemmas found in that word list. Then it would do the same with a

text collection from 1970s and so on. Having such data in the \so field

would help filtering words based on their usage in time, in different styles

and by different authors. 

Our project's programmer is working towards the comparison of lemma lists

with the dictionary database using scripts outside Toolbox, but I thought I

would ask around if anyone else had some experiences with something similar.</pre>

</blockquote>

A modified version of the cc table I mentioned on your last post would

accomplish this task as well. I haven't written it yet, but it could be

done. Contact me off-list if you would like such a cc table written. It

should be a fairly straight forward process to gather the data from one

source field and add to the \so field in another file. You would be

able to run this cc table on as many wordlists as you need or want,

each time adding the source info to your master dictionary file.<br>

<br>

Norbert Rennert<br>

Canada Institute of Linguistics<br>

<!-- |**|begin egp html banner|**| -->

<br>

  <div style="text-align:center; color:#909090; width:500px;">

  <hr style="border-bottom:1px; width:500px; text-align:left;">

  <tt>YAHOO! GROUPS LINKS</tt>

</div>

<br>

<ul>

  <tt><li type=square> Visit your group "<a href="http://groups.yahoo.com/group/lexicographylist">lexicographylist</a>" on the web.<br> </tt>

  <tt><li type=square> To unsubscribe from this group, send an email to:<br> <a href="mailto:lexicographylist-unsubscribe@yahoogroups.com?subject=Unsubscribe">lexicographylist-unsubscribe@yahoogroups.com</a><br> </tt>

  <tt><li type=square> Your use of Yahoo! Groups is subject to the <a href="http://docs.yahoo.com/info/terms/">Yahoo! Terms of Service</a>.</tt>

</ul>

<br>

<div style="text-align:center; color:#909090; width:500px;">

  <hr style="border-bottom:1px; width:500px; text-align:left;">

</div>

</br>

<!-- |**|end egp html banner|**| -->

</body>

</html>