Dialect prediction based on demographics

Fri May 20 22:25:41 UTC 2005

Hi.  I'm new to this list, and my main interest in American dialects is
based on a little research project that's been on my back burner for a
year now.  (Mainly because my boss doesn't think it's as cool as I do.)

This is probably much different from the sorts of things most of you
think about in terms of dialects, but I hope you find it interesting.

The question is this:  what is the most accurate way to predict a
person's dialect using only simple demographic information, such as
place of origin and current address?

I recognize that this is going to be sloppy.  If I can get most people
clustered together properly most of the time, I'm doing well.  In fact,
I'd probably do fine with state-level granularity and data from
www.popvssoda.com.  But, of course, I'd like to do better than that.

A little background:

My background is in computer science.  I'm the programmer for a small
company named Vocal Laboratories (www.vocalabs.com).  I've previously
worked on behavior predicting AIs (e.g. http://www.grouplens.org/ ) and
in some sense this little research project is another application of
predictive algorithms.

VocaLabs does "mystery shopper"-type usability tests of call centers.  A
lot of our business is testing speech recognition systems.  Our clients
like to know how well their computer can understand its callers.

Right now, the way that we do this sort of study is:
     (1) We get 500 or so people to call a VUI (Voice User
Interface)-based system.  We already have demographic information on
these people, and they fill out a questionnaire as part of the study.
     (2)  We generate a report and deliver it to our client.
Specifically, a computer tabulates the report.  We can't afford to
listen to each call recording individually.  See
http://www.vocalabs.com/services/samples.html for examples.
     (3)  The client reads the report, looking most closely at the
people who report having had the most difficulty.

I think it would be cool to have our computers do more analysis, in
particular have it notice when people with similar dialects report
having similar problems.  That is, if there's a significant correlation
between dialect region and dissatisfaction, the report would say so on
its front page.  I don't know of any software that could take our call
recordings and determine the dialect-- and even if I did, I probably
couldn't afford it-- so I'm interested in using the demographic
information I do have.

My thoughts so far:

Most of the research I've seen (and I'm no linguist--I've just poked
around on the web) tries to map dialects to regions, rather than mapping
regions to dialects.  The biggest difference is that exact boundaries
aren't important to me, but getting the population centers right is.  In
practice, it doesn't look like it makes much difference, since data
collection seems to be clustered by population anyway.

Like I said before, for this purpose I could probably get away with a
pretty crude region-to-dialect map.  With a sample size of 500, it might
be rare that I ever find a case of statistically significant regional
differences.  (And, of course, dialect is only one possible reason for
problems.)  I'm not sure whether or not there would be a difference
between using, for example, PopvsSoda.com data rather than the
University of Pennsylvania atlas.

Probably the easiest thing to do is to come up with a small number of
dialect clusters (e.g. four or five regions, using only state-level
graularity), see if any of our historical data shows anything of
interest, and go from there.

I'm curious if any of you have comments or insights into this sort of
region-to-dialect predicting.

David Leppik