in re Google
Page Stephens
hpst at EARTHLINK.NET
Wed May 26 18:20:06 UTC 2004
This is slightly off subject but I have been thinking of putting it up on this list because it brings up some important problems with the naive use of Google which I confess as a lazy bastard I use all the time.
I think that Sam brings up some interesting and worthwhile points which we all need to consider since we all "google" all too often.
It is reprinted with the permission of the author.
Page Stephens
The President's Corner
Issue #4
The Role of Information Science in the Post-Googlian Environment
Good evening. I am just delighted to be here and have this opportunity to
talk with you. It's great to be back at USF. Receiving my masters degree
from here is one of the most important events in my life. My love for
libraries and my interest in how technology can be used to provide better
and more efficient services guided my experiences here.
I want to thank Bahaa and Lily El-Hadidy because without them I would have
never found the American Society for Information Science and Technology.
We had so much fun celebrating the start of my presidential year at the
annual conference in Long Beach last October. Of course, learning how to
do online searches and understanding how databases are constructed are
even more valuable benefits of knowing Dr. El-Hadidy!
I also want to acknowledge Vicki Gregory and Tom Terrell. Dr. Gregory is
on the ASIS&T board of directors with me and we're working hard to get Dr.
Terrell on the board also. They organized this visit and honored me
greatly with a request to give the Alice G. Smith lecture tonight. Of
course, it is Dr. Smith's legacy that we really celebrate tonight. She
directed the school media program here for years and was instrumental in
obtaining ALA accreditation for the school. So thank you, Dr. Alice G.
Smith!
When I sent the title for this talk, Tom Terrell replied that he wasn't
sure that Google could be a verb let alone an adverb or adjective. The
funny part of all of this is that my thoughts and conception of a
post-googlian environment started about 6 months ago when a student told
me that he had googled me ....hmm, to be googled! So, pre-google, google,
post-google - well, you can see how my brain works! How many of you have
heard of Google? It is a giant search engine poised to take over the
universe of web-searching - a Googleverse of sorts. There is even a term
to describe the state of dominating the Internet search space, Googlopoly.
Seriously though, the ubiquity of this search engine and its popularity as
a single source of information frightens me. I have nothing against
Google and use it routinely for a variety of searches. My fear arises
from a speculation that there may be a time when a Google search returns
zero hits and the user assumes that the information requested does not
exist. This may already be the case. Maybe I should try "WMD" as a search
term and see what happens. The results may be a Google Hole, defined as
the state of having been led astray by Google results.
In light of this, I want to explore some of the information science tools
that may help prevent this post-googlian apocalypse I envision. To get us
started, I'll share some statistics and facts about Google and then move
to a few specific tools such as data mining, natural language processing,
mark-up languages, and dynamic databases that may offer some long-term
solutions. To close, I have some suggestions for what we can do to keep
our vast store of cultural heritage preserved and accessible - whether
Google finds it or not.
Facts about Google: In 2002, Google had less than 150 employees and
revenue of a little more than $400 million. In 2003, they had 1400
employees and revenue of $900 million. That's a 620 percent growth last
year! By the way, much of this information is from doing a Google search
with the term "Google."
In 2003, they held about 35% of the market share, just a little more than
Yahoo!, another popular search engine. So far in 2004, they hold over 40%
of the search market share and Yahoo! has dropped to below 30%. They claim
3.3 billion Web pages crawled which is less than 1% of the total Web.
AlltheWeb and Inktomi are close behind with number of pages cataloged.
Today's Google is more about advertising than searching. They recently
added the location of a searcher to the relevance factors. The results
give you advertisements of local companies for whatever your query term
is. But like most things computational, 100% perfection doesn't always
occur. According to Josh McHugh in the March 2004 issue of Wired, "even
the best-laid algorithms can backfire. In December, two Verizon ads
appeared in the New York Times site just inches below a commentary
accusing Verizon of stealing from customers. Ouch." (page 121)
So we have this giant, almost billion dollar search company that is fast
replacing Lexis-Nexis as the research tool professionals use, crawling
only the most recent version of the Web. Pages that were changed or
deleted prior to the last crawl are lost. If you know that Brewster Kahle
has the Internet Archive's Wayback Machine, this may not be a problem.
However, I doubt that many users do a dual search for versions when they
are on a serious hunt for important information.
Okay, you get the picture of how easy it is for me to speculate that we
may indeed come to the point of believing that if Google doesn't find it,
it doesn't exist. So now, let's talk about information science tools and
what they may hold to improve functionality when searching the Web and our
digital collections.
First of all, let's acknowledge that the Web is NOT a library and in
reality we only have digital collections and very few digital libraries.
My good friend Clifford Lynch, Director of the Coalition for Networked
Information, points out that historically libraries have been relatively
passive. "They make material available but it's up to the patrons to
figure out what to do with it. Now there is a view that says that digital
libraries are not just places for calling up material, they're spaces for
collaboration and annotation and analysis and for authorship."
(www.acm.org/ubiquity/interviews/pf/c_lynch_1.html)
As we continue to build digital repositories and move toward the types of
functionality Dr. Lynch describes as a digital library, we need improved
methods for searching. If a digital library is truly an organic and
evolving space then searching only the most recently crawled pages will
not give us access to the riches of these collaborative spaces.
We have data mining technologies that are similar to older visualization
techniques where you take a huge amount of data and throw some robust
computational resources at the data. For visualization, you come up with
patterns to the data such as clustering and similarity judgments and
display these in some sexy, colorful way. In data mining, we look for
relationships among the data that may not be obvious from a single search
string. For example, the Center for Disease Control may cross demographic
data with laboratory test results to begin to understand outbreaks of
influenzas and other diseases. Marketing folk love the idea of finding
patterns in purchases, whether it is as simple as Amazon's recommender
service "others that bought this book, also bought x, y, and z or as
complex as pink-colored products sell better in Tampa in the Spring.
Natural Language Processing is another information science tool that has
been with us for decades but it is only recently that we are able to
identify or parse proper names and automatically decide if they are places
or people or organizations. This has been pretty shaky and not always
right but it sure is a good first cut and may provide us with a
cost-effective method of getting at primary levels of mark-up. We can use
our human intellect more efficiently in making the decisions the machine
computation is not able to make yet. My thought here is that if we can
train call center workers in India to speak with a Chicago accent, we can
do a whole lot more with ferreting out proper names in our digital
resources.
Now, let's talk about mark-up languages. We all know metadata's momma, the
Dublin Core Categories and her grandmother, MARC. They work, they point to
author/creator, title, subject headings, etc. What they don't do very well
is keep up with semantic structures that may change over time as we "do"
things to the digital object. For example, let's say that Dr. Lynch finds
this talk on the web and annotates his quote with some clarification or
even questions to me. How do we capture and mark-up this collaborative
effort? Not with today's structures. Maybe the closest is XML or
extensible mark-up language schema that support scholarly communications
but these are expensive efforts and often I hear from the builders of
digital libraries that the cost of mark-up far exceeds the cost of
digitization.
I recently spoke at the Visual Resources Conference about preservation of
born-digital art. One of the questions from the audience was "why in the
world do we want to preserve every iteration of a digital work?" I didn't
have an immediate answer other than to comment that of course, selection
is still something we do and it is a professional obligation. I've had
some time to think now and it occurs to me that just as the papyrus tag
tells us that a scroll or document existed in the past, we may need to be
sure that if nothing else, we keep records of the wrappers or metadata for
all digital objects. Knowledge that something existed does not equal
access to the object itself but it sure beats not having a clue!
We may need to start thinking about mark-up as a series of events. It may
even be that mark-up evolves over time with the object. This is an area
that my friend and colleague Bill Moen is working on as he builds
interoperability for the Virtual Library of Texas.
If we have series or levels of description then how do we keep track of
the different versions? That's a good question isn't it?
The last tool I want to mention tonight is the idea of an interactive
database that is searchable on the Web. Dynamic databases as they are
sometimes called, really present a problem in the search engine world.
Right now, if you want deep content from say the New York Times archive
and the Public Library Statistics database, you have to search them
separately even though they are both available on the Web. Linking this
dynamic content is also a problem. You have .asp or windows active server
pages and .cfm, proprietary Cold Fusion software and .php from the
OpenSource Software world.
At UNT, we have been working with a collection of black and white
photographs from the Sepia Magazine archive. We recently moved them from
MS Access to MySQL and coded the fields and tables so you can search the
database on the Web. It works and I am absolutely enamored of opensource
software solutions.
The Sepia collection is a slice of Americana from the 1950s through the
1980s. It is unique and an incredible source of information about Black
America during this time period and thanks to the African American Museum
in Fair Park, Dallas we have the opportunity to work with these images.
There are no thesauri or controlled vocabularies that work with the
variety of images in this collection. Added to this is the fact that for
approximately half of the collection, we have no indicators for subject,
nor any existing keywords or descriptors.
We decided to experiment with a subset of the images and see if we can
build a thesaurus on the fly with user supplied descriptors. So far so
good but we still need to do time-intensive human interpretation and
intervention to ensure quality. Here, I need to make my shameless plug:
please go to www.sepiaproject.unt.edu and take the user survey. I promise
to let you know how it works out.
I think we will see more and more sites linked to dynamic databases that
allow user feedback in some form or another. How do we track and mark-up
these versions that result from this rich input? And more important how
do we find this information without deep mark-up that reflects these
changes?
So these are a few of the information science tools I wanted to bring to
your attention. Before we get to specific suggestions, I should point out
that there are many other crucial issues we should be thinking about that
I have not addressed -- things in the realm of policies and politics, such
as copyright, privacy issues, authenticity and validation of sources and
the ever sticky stuff around funding and maintenance and migration. I save
these important issues for another talk but bring them to your attention
as they color everything we do.
In closing, let's spend a few minutes on what you can do to help preserve
and provide access to our next generation of digital collections and
libraries as well as improve our search strategies. I have 7 areas for
your consideration.
1. Collaborate, talk to people from mathematics and engineering and
computer science and information science, libraries, funding agencies and
industry.
2. Talk about your concerns and publish what doesn't work so we can begin
to define difficult areas and build coordinated research agendas.
3. Throw huge hunks of computational power at some of this stuff for first
levels of analysis and use your more expensive human intellect for
clean-up and decision making processes.
4. Invest in metadata and deep mark-up that can be migrated and added to
as needed. Pay attention to standards and deploy them in your projects.
5. Make preservation decisions and develop appropriate policies. Think
about migration and longevity and how you will serve up your objects and
accompanying information about those objects in the future.
As an aside, I would be terribly remiss if I didn't take this opportunity
to push a simple preservation strategy I really believe works. LOCKSS,
"lot of copies keep stuff safe" is a wonderful concept and it works for
books as well as for digital content!
6. Let's keep training and recruiting the best and the brightest into our
programs of library and information sciences education. I owe a great
debt to IMLS (the federal institute of Museum and Library Services,
www.imls.gov) for the support they gave me to develop our program in
Digital Image Management. The specialty requires a strange mix of
technologist and manager but our graduates are sought after by libraries,
museums and industry for their unusual skill sets.
7. In conclusion, let's bring back bibliographic instruction and call it
digital library instruction or how to find what you need with or without
Google. I think our role as educators can not be underplayed nor traded
for cheap tutorials. We need to help our patrons learn how to recognize
authenticity and know when they have exhausted a search.
Well, that's all I have for now. Let's open this up for comments and
questions from you.
Best regards,
Sam
Samantha Hastings
2004 ASIS&T President
hastings at lis.admin.unt.edu
American Society for Information Science and Technology
1320 Fenwick Lane, Suite 510, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail: asis at asis.org
Copyright 2004, American Society for Information Science and Technology
More information about the Ads-l
mailing list