in re Google

Wed May 26 18:20:06 UTC 2004

This is slightly off subject but I have been thinking of putting it up on this list because it brings up some important problems with the naive use of Google which I confess as a lazy bastard I use all the time.

I think that Sam brings up some interesting and worthwhile points which we all need to consider since we all "google" all too often.

It is reprinted with the permission of the author.

Page Stephens

      The President's Corner
      Issue #4
      The Role of Information Science in the Post-Googlian Environment
      Good evening.  I am just delighted to be here and have this opportunity to 
      talk with you.  It's great to be back at USF. Receiving my masters degree 
      from here is one of the most important events in my life.  My love for 
      libraries and my interest in how technology can be used to provide better 
      and more efficient services guided my experiences here.  
      I want to thank Bahaa and Lily El-Hadidy because without them I would have 
      never found the American Society for Information Science and Technology.  
      We had so much fun celebrating the start of my presidential year at the 
      annual conference in Long Beach last October. Of course, learning how to 
      do online searches and understanding how databases are constructed are 
      even more valuable benefits of knowing Dr. El-Hadidy! 
      I also want to acknowledge Vicki Gregory and Tom Terrell.  Dr. Gregory is 
      on the ASIS&T board of directors with me and we're working hard to get Dr. 
      Terrell on the board also. They organized this visit and honored me 
      greatly with a request to give the Alice G. Smith lecture tonight.  Of 
      course, it is Dr. Smith's legacy that we really celebrate tonight. She 
      directed the school media program here for years and was instrumental in 
      obtaining ALA accreditation for the school.  So thank you, Dr. Alice G. 
      Smith! 
      When I sent the title for this talk, Tom Terrell replied that he wasn't 
      sure that Google could be a verb let alone an adverb or adjective.  The 
      funny part of all of this is that my thoughts and conception of a 
      post-googlian environment started about 6 months ago when a student told 
      me that he had googled me ....hmm, to be googled! So, pre-google, google, 
      post-google - well, you can see how my brain works!  How many of you have 
      heard of Google?  It is a giant search engine poised to take over the 
      universe of web-searching - a Googleverse of sorts. There is even a term 
      to describe the state of dominating the Internet search space, Googlopoly. 

      Seriously though, the ubiquity of this search engine and its popularity as 
      a single source of information frightens me.  I have nothing against 
      Google and use it routinely for a variety of searches.  My fear arises 
      from a speculation that there may be a time when a Google search returns 
      zero hits and the user assumes that the information requested does not 
      exist. This may already be the case.  Maybe I should try "WMD" as a search 
      term and see what happens. The results may be a Google Hole, defined as 
      the state of having been led astray by Google results. 
      In light of this, I want to explore some of the information science tools 
      that may help prevent this post-googlian apocalypse I envision. To get us 
      started, I'll share some statistics and facts about Google and then move 
      to a few specific tools such as data mining, natural language processing, 
      mark-up languages, and dynamic databases that may offer some long-term 
      solutions. To close, I have some suggestions for what we can do to keep 
      our vast store of cultural heritage preserved and accessible - whether 
      Google finds it or not. 
      Facts about Google:  In 2002, Google had less than 150 employees and 
      revenue of a little more than $400 million. In 2003, they had 1400 
      employees and revenue of $900 million.  That's a 620 percent growth last 
      year!  By the way, much of this information is from doing a Google search 
      with the term "Google."  
      In 2003, they held about 35% of the market share, just a little more than 
      Yahoo!, another popular search engine.  So far in 2004, they hold over 40% 
      of the search market share and Yahoo! has dropped to below 30%. They claim 
      3.3 billion Web pages crawled which is less than 1% of the total Web. 
      AlltheWeb and Inktomi are close behind with number of pages cataloged.   
      Today's Google is more about advertising than searching. They recently 
      added the location of a searcher to the relevance factors. The results 
      give you advertisements of local companies for whatever your query term 
      is.  But like most things computational, 100% perfection doesn't always 
      occur. According to Josh McHugh in the March 2004 issue of Wired, "even 
      the best-laid algorithms can backfire.  In December, two Verizon ads 
      appeared in the New York Times site just inches below a commentary 
      accusing Verizon of stealing from customers.  Ouch." (page 121) 
      So we have this giant, almost billion dollar search company that is fast 
      replacing Lexis-Nexis as the research tool professionals use, crawling 
      only the most recent version of the Web. Pages that were changed or 
      deleted prior to the last crawl are lost. If you know that Brewster Kahle 
      has the Internet Archive's Wayback Machine, this may not be a problem.  
      However, I doubt that many users do a dual search for versions when they 
      are on a serious hunt for important information. 
      Okay, you get the picture of how easy it is for me to speculate that we 
      may indeed come to the point of believing that if Google doesn't find it, 
      it doesn't exist.  So now, let's talk about information science tools and 
      what they may hold to improve functionality when searching the Web and our 
      digital collections.  
      First of all, let's acknowledge that the Web is NOT a library and in 
      reality we only have digital collections and very few digital libraries.  
      My good friend Clifford Lynch, Director of the Coalition for Networked 
      Information, points out that historically libraries have been relatively 
      passive.  "They make material available but it's up to the patrons to 
      figure out what to do with it. Now there is a view that says that digital 
      libraries are not just places for calling up material, they're spaces for 
      collaboration and annotation and analysis and for authorship." 
      (www.acm.org/ubiquity/interviews/pf/c_lynch_1.html)  
      As we continue to build digital repositories and move toward the types of 
      functionality Dr. Lynch describes as a digital library, we need improved 
      methods for searching.  If a digital library is truly an organic and 
      evolving space then searching only the most recently crawled pages will 
      not give us access to the riches of these collaborative spaces.
      We have data mining technologies that are similar to older visualization 
      techniques where you take a huge amount of data and throw some robust 
      computational resources at the data.  For visualization, you come up with 
      patterns to the data such as clustering and similarity judgments and 
      display these in some sexy, colorful way.  In data mining, we look for 
      relationships among the data that may not be obvious from a single search 
      string. For example, the Center for Disease Control may cross demographic 
      data with laboratory test results to begin to understand outbreaks of 
      influenzas and other diseases.  Marketing folk love the idea of finding 
      patterns in purchases, whether it is as simple as Amazon's recommender 
      service "others that bought this book, also bought x, y, and z or as 
      complex as pink-colored products sell better in Tampa in the Spring. 
      Natural Language Processing is another information science tool that has 
      been with us for decades but it is only recently that we are able to 
      identify or parse proper names and automatically decide if they are places 
      or people or organizations.  This has been pretty shaky and not always 
      right but it sure is a good first cut and may provide us with a 
      cost-effective method of getting at primary levels of mark-up. We can use 
      our human intellect more efficiently in making the decisions the machine 
      computation is not able to make yet.  My thought here is that if we can 
      train call center workers in India to speak with a Chicago accent, we can 
      do a whole lot more with ferreting out proper names in our digital 
      resources.  
      Now, let's talk about mark-up languages. We all know metadata's momma, the 
      Dublin Core Categories and her grandmother, MARC. They work, they point to 
      author/creator, title, subject headings, etc. What they don't do very well 
      is keep up with semantic structures that may change over time as we "do" 
      things to the digital object.  For example, let's say that Dr. Lynch finds 
      this talk on the web and annotates his quote with some clarification or 
      even questions to me.  How do we capture and mark-up this collaborative 
      effort?  Not with today's structures. Maybe the closest is XML or 
      extensible mark-up language schema that support scholarly communications 
      but these are expensive efforts and often I hear from the builders of 
      digital libraries that the cost of mark-up far exceeds the cost of 
      digitization. 
      I recently spoke at the Visual Resources Conference about preservation of 
      born-digital art.  One of the questions from the audience was "why in the 
      world do we want to preserve every iteration of a digital work?" I didn't 
      have an immediate answer other than to comment that of course, selection 
      is still something we do and it is a professional obligation. I've had 
      some time to think now and it occurs to me that just as the papyrus tag 
      tells us that a scroll or document existed in the past, we may need to be 
      sure that if nothing else, we keep records of the wrappers or metadata for 
      all digital objects. Knowledge that something existed does not equal 
      access to the object itself but it sure beats not having a clue!  
      We may need to start thinking about mark-up as a series of events. It may 
      even be that mark-up evolves over time with the object. This is an area 
      that my friend and colleague Bill Moen is working on as he builds 
      interoperability for the Virtual Library of Texas. 
      If we have series or levels of description then how do we keep track of 
      the different versions?  That's a good question isn't it?
      The last tool I want to mention tonight is the idea of an interactive 
      database that is searchable on the Web.  Dynamic databases as they are 
      sometimes called, really present a problem in the search engine world. 
      Right now, if you want deep content from say the New York Times archive 
      and the Public Library Statistics database, you have to search them 
      separately even though they are both available on the Web. Linking this 
      dynamic content is also a problem. You have .asp or windows active server 
      pages and .cfm, proprietary Cold Fusion software and .php from the 
      OpenSource Software world.  
      At UNT, we have been working with a collection of black and white 
      photographs from the Sepia Magazine archive. We recently moved them from 
      MS Access to MySQL and coded the fields and tables so you can search the 
      database on the Web.  It works and I am absolutely enamored of opensource 
      software solutions. 
      The Sepia collection is a slice of Americana from the 1950s through the 
      1980s. It is unique and an incredible source of information about Black 
      America during this time period and thanks to the African American Museum 
      in Fair Park, Dallas we have the opportunity to work with these images.  
      There are no thesauri or controlled vocabularies that work with the 
      variety of images in this collection. Added to this is the fact that for 
      approximately half of the collection, we have no indicators for subject, 
      nor any existing keywords or descriptors. 
      We decided to experiment with a subset of the images and see if we can 
      build a thesaurus on the fly with user supplied descriptors. So far so 
      good but we still need to do time-intensive human interpretation and 
      intervention to ensure quality.  Here, I need to make my shameless plug: 
      please go to www.sepiaproject.unt.edu and take the user survey. I promise 
      to let you know how it works out. 
      I think we will see more and more sites linked to dynamic databases that 
      allow user feedback in some form or another.  How do we track and mark-up 
      these versions that result from this rich input?  And more important how 
      do we find this information without deep mark-up that reflects these 
      changes?
      So these are a few of the information science tools I wanted to bring to 
      your attention. Before we get to specific suggestions, I should point out 
      that there are many other crucial issues we should be thinking about that 
      I have not addressed -- things in the realm of policies and politics, such 
      as copyright, privacy issues, authenticity and validation of sources and 
      the ever sticky stuff around funding and maintenance and migration. I save 
      these important issues for another talk but bring them to your attention 
      as they color everything we do.
      In closing, let's spend a few minutes on what you can do to help preserve 
      and provide access to our next generation of digital collections and 
      libraries as well as improve our search strategies.  I have 7 areas for 
      your consideration.
      1. Collaborate, talk to people from mathematics and engineering and 
      computer science and information science, libraries, funding agencies and 
      industry. 
      2. Talk about your concerns and publish what doesn't work so we can begin 
      to define difficult areas and build coordinated research agendas.  
      3. Throw huge hunks of computational power at some of this stuff for first 
      levels of analysis and use your more expensive human intellect for 
      clean-up and decision making processes.
      4. Invest in metadata and deep mark-up that can be migrated and added to 
      as needed. Pay attention to standards and deploy them in your projects. 
      5. Make preservation decisions and develop appropriate policies.  Think 
      about migration and longevity and how you will serve up your objects and 
      accompanying information about those objects in the future. 
      As an aside, I would be terribly remiss if I didn't take this opportunity 
      to push a simple preservation strategy I really believe works.  LOCKSS, 
      "lot of copies keep stuff safe" is a wonderful concept and it works for 
      books as well as for digital content!  
      6. Let's keep training and recruiting the best and the brightest into our 
      programs of library and information sciences education.  I owe a great 
      debt to IMLS (the federal institute of Museum and Library Services, 
      www.imls.gov) for the support they gave me to develop our program in 
      Digital Image Management. The specialty requires a strange mix of 
      technologist and manager but our graduates are sought after by libraries, 
      museums and industry for their unusual skill sets.
      7. In conclusion, let's bring back bibliographic instruction and call it 
      digital library instruction or how to find what you need with or without 
      Google.  I think our role as educators can not be underplayed nor traded 
      for cheap tutorials. We need to help our patrons learn how to recognize 
      authenticity and know when they have exhausted a search. 
      Well, that's all I have for now. Let's open this up for comments and 
      questions from you.
      Best regards,
      Sam
      Samantha Hastings
      2004 ASIS&T President
      hastings at lis.admin.unt.edu

            American Society for Information Science and Technology
            1320 Fenwick Lane, Suite 510, Silver Spring, Maryland 20910, USA
            Tel. 301-495-0900, Fax: 301-495-0810 | E-mail: asis at asis.org

      Copyright 2004, American Society for Information Science and Technology