11 The Deep Web

We learn what the deep Web is and why we need to care about it.

Class held on 10/08/2008. Student notes are available on this page. Possible questions are available on this page.

Class structure

  1. Go through “At beginning of class” information
  2. I'll lecture for a bit (using some slides).
  3. Work on exercises.

At beginning of class

  1. My wife's biopsy went well. We'll find out the results by no later than Monday (or so we're told).
    • Thank you for your kind words of support.
  2. Check your wiki grades. If you got a 10, please be sure to add the blog to the class wiki.
  3. Be sure to have all of your information entered under “Basic/Grades”. Otherwise, I can't record your grades!
    • Don't forget — this is how you “turn in” your assignments.
  4. Blog entries
    • to/too
    • choose/chose (lose/loose)
    • definitely/definately/defiantly
  5. Check who is doing class notes for today.
  6. I'm working on grading and recording. I'm about half way through grading your analysis assignments.
  7. This is my write-up of the analysis assignment.
  8. Industry updates

My notes

  1. General Web search
    • Suggests that all information can be searched within one system
    • Easy and self-explanatory
    • Has only a limited understanding of "structure"
  2. The Invisible Web
    • "Invisible" to the general search engines since they don't index it
    • You'll hear about the "Invisible Web" or the "Deep Web" — same thing
    • Pages that are invisible
      • Disconnected page
      • Page consisting primarily of images, audio, video
      • Flash, Shockwave, compressed files
      • Content retrieved as a result of filling out forms
      • Real time information (ex: stock quotes)
      • Pages that are proprietary
    • Significance of the Invisible Web
      • Bergman's widely-cited statistic is that there are 550 billion documents in the invisible Web
        • Others believe it's more like 20-100 billion
      • Estimated that there's about 300K Web sites with queryable databases
  3. Academic Web-based search
    • More academic content is moving to the Web exclusively
    • Part of general trend from print to electronic
    • Much of this is contained in the Invisible Web
  4. Explain how search engines work
    • General
      • Crawlers go out and send information back to the central database
      • Queries go against the central database
      • SE company expertise is in design of the index and design of the query process (including input interface and output formatting and reporting)
    • Academic
      • Crawlers go out, find a database, and what? Index the query interface page? Send some standard queries to the index and save the results?
  5. Should you consider using Google Scholar?
    • Pros
      • A cross-database (federated) search engine
      • Returns snippets from articles (and sometimes abstracts)
      • Indexes the full text (actually, part of the full text) and not just the abstracts and subject terms
      • Can link to your own school's library
    • Cons
      • Secretive about its coverage of specific publishers, journals
      • Limits it searches to the first 100-120K of a page
      • Hasn't been updated much (at all?) since its launch
      • Returns far fewer documents than the native search engines
      • Searching by field is fairly unreliable and counter-productive
  6. What do we want from an academic search engine?
    • Comprehensive
      • Contains lots of journals over lots of topics
      • Goes far back in time
      • Up-to-date
    • Integrated across databases
    • Integrated into a database
    • Transparent as to what it contains or doesn't contain
  7. Recommendation
    • Use Google Scholar
      • as a way to find free, online versions of articles you already know you want
      • like you use Wikipedia — as a good starting place for exploring
    • Use the other Deep Web search tools — Scirus, Turbo10, plus the LII.
    • To do a complete search, you should definitely talk to a librarian and use the Library's immense set of resources.

In-class exercises

  1. Google Scholar
    1. Basics
      1. intitle:"carbon trading" — 271 citations
        • Cited by
        • Referenced by (under “Related articles”)
        • Web search
        • Availability at UM library
    2. Weird logic
      1. the — 2.03 billion records
      2. a — 13.1 million records
      3. a OR the — 13.6 million records
    3. Subject groups
      1. intitle:Vietnamese — 9,690 records
      2. allintitle:Vietnam — 816,000 records (all subject areas)
        • Shockingly, as you will soon see, intitle:Vietnam returns the exact same results (which it should).
      3. allintitle:Vietnam — 29,100 records (with all of the subject areas checked)
      4. allintitle: Vietnam OR Vietnamese — 104,000 records (notice that this is less than the 816,000 found for Vietnam alone above)
      5. allintitle: Vietnam OR Vietnamese — 141,000 records (with all of the subject areas checked)
      6. Publication year strangeness
        1. allintitle: Vietnam OR Vietnamese 1435-2008 — 151,000 records
        2. allintitle: Vietnam OR Vietnamese 1960-2008 — 152,000 records
        3. allintitle: Vietnam OR Vietnamese 2010-2050 — 12 records
  2. Scirus
    1. title:carbon AND title:trading (market) — 560 hits
      • Ability to filter on the left (sources, file types)
      • Recommendations of refining your search on the left
      • Save or email the results.
      • Sort by relevance or date.
      • Similar results
  3. Turbo10
    1. Search for [carbon trading] at Turbo10.
      • Topic clusters
      • Engines
  4. BNet
    1. carbon trading
      • Content types to right
      • RSS feeds

Possible blog entries

There are two possible blog entries related to this class — you can write one, both or neither of these. But I would find these interesting.

  1. Write a blog entry on what you observed, what you learned and found interesting, focusing on information that other students might find useful.
  2. Go talk to a Ross librarian. Tell them your topic and ask what 3 to 5 databases or tools that you might find most useful given that topic. See what databases they might tell you to focus on. Use them for a while. By the end of the semester, write a blog entry describing how the information you find in these databases differs from what you would find in the Web at large or what you found in the Deep Web search tools we were introduced to above.

BTW, I would find it rather remarkable if you didn't have in your term project a section or group of resources or something related to information a person could get in a library's database (compared with Deep Web and the Web itself).

Resources

Search tools

  1. Scirus
  2. Google Scholar
  3. CompletePlanet — 70K databases (but appears to be dead!)
  4. Amazon Advanced Book Search — Yes, I am including Amazon, the book seller, on this list.
  5. University of Michigan Library
  6. InfoMine (UCal, Riverside)
  7. Librarian's Internet Index
    • Overview: describes who they are, what they do, and what you might expect to get from looking at their site.
  8. Directory of Open Access Journals — 1262 journals are searchable at the article level, 211,294 articles.
  9. Turbo10
  10. Directory of Open Access Journals
  11. Microsoft Live Search Academic — closed down in May 2008.

Articles

  1. Accessing the Deep Web
  2. Exploring the academic invisible Web
  3. Google Scholar revisited by Peter Jascso, Online Information Review, 32:1, 2008, pp. 102—114.
  4. The Deep Web: Surfacing hidden value
    • As summarized by the editor of The Journal of Electronic Publishing: "Michael K. Bergman, whose BrightPlanet company offers a new approach to search engines, examines the wealth of information that is available only on dynamically created Web sites, those that don't exist except as relational databases until someone seeks information from them. As more sites adopt the dynamic approach to pages, they are creating a challenge for standard search engines. This article looks at some alternatives."
  5. Search engine technology and digital libraries: Libraries need to discover the academic internet
  6. Google Scholar -- a new data source for citation analysis, by Anne-Wil Harzing, February 5, 2008 (7th version).

E-books

  1. Google Book Search
  2. Project Gutenberg
  3. American Memory (by the U.S. Library of Congress)
  4. Million Book Project
  5. Google Electronic Text Archives

Other

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License