[Lexicog] Digital Glossarization

Sat May 10 20:29:42 UTC 2008

There will be two subtopics to discuss. The first will be discussed
here and the second will follow in the next message.

First discussion on word parsing/plucking tool:

Since I am not a lexicographer, I have to make up my own parsing
technique. The description is given below:

1. Parse all words from a book or a chapter. The definition of a
parsed word is a word between whitespaces (spaces, tabs, line
breakers) and is trimmed off the non-letters at both ends. The "'s"
suffixes are also removed.

2. Test each word against a list of common words (the, are, with,
etc.) If not in the common list, save it in either of two lists: one
for capitalized words and other for the rest. Each word of the first
list will be put into an "unplucked" list. Most of capitalized words
are proper names that we don't need to look up the dictionary.

3. With each word of the latter list, navigate the web for word
definition(s) known as "plucking" method. If found, save the webpage
in the local system for further processing. Before saving, the
document needs to be "cleaned-up" for the purpose of easy glossary
making. If not found, put the word in the unplucked list as well for
human intervention.

There are some special interventions during the plucking process. The
user may see some words in the word list that should not be used for
plucking process. Some words that are not English words or something
odd should be removed from the list. Moreover, if one sees a known
word that not be needed to look up, it should be moved into the common
list.  

That's all for the first tool to do before the next stage.

There is a problem. Where do we get a list of common words to begin
with? I have no access to such a list to my satisfaction in the web.
Maybe I just overlooked. Only way for me is to make up a list by the
aid of a computer. What I did is to count all words in a given
collection of writing books by one author. Cut off all words that have
low counts and weed off any unknown words by hand. Here I got the list
of common words. In my case, I used Sir Arthur Conan Doyle's
collection. (I like to read detective stories.) The common list is
about 8,000 words long which includes affixed words. The list will be
increased later during doing some plucking work. My estimate for the
word list to look up would be about 500 words after the plucking
process is completed. By the way, I developed a small program that
counts words in a given text. It also accumulate the counts for next
texts, ie. book by book.

Another problem. Which dictionary in the net I should choose to look
up. I found one at www.edict.com.hk. I really like it. It has many
entries that have affixed words. Each webpage is much easier to "clean
up" for glossary making.

The words in the unplucked list will be dealt later.

Once the saved webpages are ready, the glossary making will commence.
The discussion will follow in the next message.

Program completion: 90%

Todo: Further cleaning up webpages to a satisfactory level.

If anyone is interested, I'll be happy to upload a screenshot of the
software tool or two.

Thank you.

Geo Massar

------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/