Analyze word and phrase frequency
Herb Stahlke
hfwstahlke at GMAIL.COM
Sat Apr 4 17:23:44 UTC 2009
Tom,
Have you tested your definitions for accuracy. I understand that a
computer count won't get everything right and so you have to program
in certain common strings that should be omitted from the count. But
excluding single letters followed by a period? You would lose cases
like
She ran as fast as I.
Since you'd be counting all sorts of unprepared text, the treatment of
hyphenated words would treat "non-" like a word. Counting all
two-word compounds as two words ignores the morphology of compounding,
which is not reflected accurately in our orthography.
Ignoring misspellings means running spellcheck on your results, but
the problem with spellcheck is that it only excludes words that don't
match anything in its dictionary. I spent a lot of time in the 80s
working in computer assisted instruction, and one of the projects I
devoted time to was developing a probabilistic spelling checker, a
program that could look a word that doesn't match the dictionary and
judge, by using letter frequency by position by length of word,
whether a misspelling is an otherwise correct answer. Most CAI simply
rejected all answers that weren't an exact match, which isn't very
useful in the language arts.
So what do you lose by your rules and what does this do to the
accuracy of your word counts?
Herb
On Sat, Apr 4, 2009 at 11:13 AM, Tom Zurinskas <truespel at hotmail.com> wrote:
> ---------------------- Information from the mail header -----------------------
> Sender: American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> Poster: Tom Zurinskas <truespel at HOTMAIL.COM>
> Subject: Re: Analyze word and phrase frequency
> -------------------------------------------------------------------------------
>
> The word counter is good. It results in what looks like a spreadsheet, which is good, but I need to be able to copy/paste it into a regular spreadsheet. Problem - Only one line can be selected at a time. Not good. Anyone else have that problem?
>
> http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency
>
> To download it click on the blue word "link" at middle right.
>
> Computers do a great job at counting words. So we need to define what computer counted words "compwords" are.
>
> 1. A letter string bordered by spaces.
> 2. Intelligible (no typos).
> 3. Does not include numbers, punctuation, acronyms.
> 4. Two-word words (like tidal wave) are two words.
> 5. Hyphens count as spaces so hyphenated words are two words.
> 6. Reattach hyphenated words at end of line (or ignore).
> 7. Single letters followed by periods are not words.
> any more?
>
> Tom Zurinskas, USA - CT20, TN3, NJ33, FL5+
> see truespel.com
>
>
>
>
> ----------------------------------------
>> Date: Wed, 1 Apr 2009 21:45:21 -0400
>> From: jharbeck at SYMPATICO.CA
>> Subject: Fwd: Analyze word and phrase frequency
>> To: ADS-L at LISTSERV.UGA.EDU
>>
>> ---------------------- Information from the mail header -----------------------
>> Sender: American Dialect Society
>> Poster: James Harbeck
>> Subject: Fwd: Analyze word and phrase frequency
>> -------------------------------------------------------------------------------
>>
>> This looks like it could be useful for some kinds of analysis.
>>
>> -----Original Message-----
>>
>> http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency
>>
>> You can check the number of words in just about any word processing
>> program, but what about the distribution of those words?
>>
>> Primitive Word Counter analyzes text from your clipboard or file and
>> returns the frequency of words and phrases in the text. You can set a
>> minimum word length and have it ignore numbers to trim down the
>> volume of replies it returns.
>>
>> ------------------------------------------------------------
>> The American Dialect Society - http://www.americandialect.org
> _________________________________________________________________
> Rediscover Hotmail®: Now available on your iPhone or BlackBerry
> http://windowslive.com/RediscoverHotmail?ocid=TXT_TAGLM_WL_HM_Rediscover_Mobile1_042009
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
>
------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org
More information about the Ads-l
mailing list