Analyze word and phrase frequency

Tom Zurinskas truespel at HOTMAIL.COM
Sun Apr 5 18:14:21 UTC 2009


It's not my decision, but the computer's.  It doesn't know that "tidal wave" is one word.  Thus a computer-identified word count has some constraints.  So by definition a computer counted word can't be a hyphenated or two-word word.  It's just something to be cognizant of.



Tom Zurinskas, USA - CT20, TN3, NJ33, FL5+
see truespel.com













----------------------------------------
> Date: Sat, 4 Apr 2009 19:34:45 -0400
> From: hfwstahlke at GMAIL.COM
> Subject: Re: Analyze word and phrase frequency
> To: ADS-L at LISTSERV.UGA.EDU
>
> ---------------------- Information from the mail header -----------------------
> Sender: American Dialect Society
> Poster: Herb Stahlke
> Subject: Re: Analyze word and phrase frequency
> -------------------------------------------------------------------------------
>
> But why did you decide that there are no compound words in which the
> words are separated orthographically by spaces? That orthographic
> convention is a cultural convention and not clearly grounded in
> linguistic structure.
>
> Herb
>
> On Sat, Apr 4, 2009 at 5:56 PM, Tom Zurinskas wrote:
>> ---------------------- Information from the mail header -----------------------
>> Sender: American Dialect Society
>> Poster: Tom Zurinskas
>> Subject: Re: Analyze word and phrase frequency
>> -------------------------------------------------------------------------------
>>
>> Point is that word lists from counters need manual culling. One obvious outcome is that there are no two-word words (like "tidal wave").
>>
>> If typos could be figured out and retyped, that would be ideal for the word count.
>>
>> These decisions show what I went through in my culling process for the 5000 word list used in book 4.
>>
>>
>> Tom Zurinskas, USA - CT20, TN3, NJ33, FL5+
>> see truespel.com
>>
>>
>>
>>
>>
>>> ---------------------- Information from the mail header -----------------------
>>> Sender: American Dialect Society
>>> Poster: Herb Stahlke
>>> Subject: Re: Analyze word and phrase frequency
>>> -------------------------------------------------------------------------------
>>>
>>> Tom,
>>>
>>> Have you tested your definitions for accuracy. I understand that a
>>> computer count won't get everything right and so you have to program
>>> in certain common strings that should be omitted from the count. But
>>> excluding single letters followed by a period? You would lose cases
>>> like
>>>
>>> She ran as fast as I.
>>>
>>> Since you'd be counting all sorts of unprepared text, the treatment of
>>> hyphenated words would treat "non-" like a word. Counting all
>>> two-word compounds as two words ignores the morphology of compounding,
>>> which is not reflected accurately in our orthography.
>>>
>>> Ignoring misspellings means running spellcheck on your results, but
>>> the problem with spellcheck is that it only excludes words that don't
>>> match anything in its dictionary. I spent a lot of time in the 80s
>>> working in computer assisted instruction, and one of the projects I
>>> devoted time to was developing a probabilistic spelling checker, a
>>> program that could look a word that doesn't match the dictionary and
>>> judge, by using letter frequency by position by length of word,
>>> whether a misspelling is an otherwise correct answer. Most CAI simply
>>> rejected all answers that weren't an exact match, which isn't very
>>> useful in the language arts.
>>>
>>> So what do you lose by your rules and what does this do to the
>>> accuracy of your word counts?
>>>
>>> Herb
>>>
>>> On Sat, Apr 4, 2009 at 11:13 AM, Tom Zurinskas wrote:
>>>> ---------------------- Information from the mail header -----------------------
>>>> Sender: American Dialect Society
>>>> Poster: Tom Zurinskas
>>>> Subject: Re: Analyze word and phrase frequency
>>>> -------------------------------------------------------------------------------
>>>>
>>>> The word counter is good. It results in what looks like a spreadsheet, which is good, but I need to be able to copy/paste it into a regular spreadsheet. Problem - Only one line can be selected at a time. Not good. Anyone else have that problem?
>>>>
>>>> http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency
>>>>
>>>> To download it click on the blue word "link" at middle right.
>>>>
>>>> Computers do a great job at counting words. So we need to define what computer counted words "compwords" are.
>>>>
>>>> 1. A letter string bordered by spaces.
>>>> 2. Intelligible (no typos).
>>>> 3. Does not include numbers, punctuation, acronyms.
>>>> 4. Two-word words (like tidal wave) are two words.
>>>> 5. Hyphens count as spaces so hyphenated words are two words.
>>>> 6. Reattach hyphenated words at end of line (or ignore).
>>>> 7. Single letters followed by periods are not words.
>>>> any more?
>>>>
>>>> Tom Zurinskas, USA - CT20, TN3, NJ33, FL5+
>>>> see truespel.com
>>>>
>>>>
>>>>
>>>>
>>>> ----------------------------------------
>>>>> Date: Wed, 1 Apr 2009 21:45:21 -0400
>>>>> From: jharbeck at SYMPATICO.CA
>>>>> Subject: Fwd: Analyze word and phrase frequency
>>>>> To: ADS-L at LISTSERV.UGA.EDU
>>>>>
>>>>> ---------------------- Information from the mail header -----------------------
>>>>> Sender: American Dialect Society
>>>>> Poster: James Harbeck
>>>>> Subject: Fwd: Analyze word and phrase frequency
>>>>> -------------------------------------------------------------------------------
>>>>>
>>>>> This looks like it could be useful for some kinds of analysis.
>>>>>
>>>>> -----Original Message-----
>>>>>
>>>>> http://lifehacker.com/5190716/primitive-word-counter-analyzes-word-and-phrase-frequency
>>>>>
>>>>> You can check the number of words in just about any word processing
>>>>> program, but what about the distribution of those words?
>>>>>
>>>>> Primitive Word Counter analyzes text from your clipboard or file and
>>>>> returns the frequency of words and phrases in the text. You can set a
>>>>> minimum word length and have it ignore numbers to trim down the
>>>>> volume of replies it returns.
>>>>>
>>>>> ------------------------------------------------------------
>>>>> The American Dialect Society - http://www.americandialect.org
>>>> _________________________________________________________________
>>>> Rediscover Hotmail®: Now available on your iPhone or BlackBerry
>>>> http://windowslive.com/RediscoverHotmail?ocid=TXT_TAGLM_WL_HM_Rediscover_Mobile1_042009
>>>>
>>>> ------------------------------------------------------------
>>>> The American Dialect Society - http://www.americandialect.org
>>>>
>>>
>>> ------------------------------------------------------------
>>> The American Dialect Society - http://www.americandialect.org
>> _________________________________________________________________
>> Rediscover Hotmail®: Get quick friend updates right in your inbox.
>> http://windowslive.com/RediscoverHotmail?ocid=TXT_TAGLM_WL_HM_Rediscover_Updates1_042009
>>
>> ------------------------------------------------------------
>> The American Dialect Society - http://www.americandialect.org
>>
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
_________________________________________________________________
Rediscover Hotmail®: Get quick friend updates right in your inbox.
http://windowslive.com/RediscoverHotmail?ocid=TXT_TAGLM_WL_HM_Rediscover_Updates1_042009

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list