12 Research Sites

We learn about several kinds of academic research sites. We also learn what the Deep Web is, why we need to care about it, and how we might go about accessing it.

Class held on 10/21/2009. (student notes; possible questions).

Class structure

  1. Go through “At beginning of class” information
  2. I'll lecture for a bit (no slides today).
  3. Work on exercises.

At beginning of class

On your own

  1. Read the current to-do list on the course home page.
  2. No grades (too much other prep going on)

What I'll cover

  1. Project stuff
    • RSS feeds vs current events stuff
  2. Grading stuff
    • Dean
    • Priority: status report feedback

My notes

  1. General Web search
    • Suggests that all information can be searched within one system
    • Easy and self-explanatory
    • Has only a limited understanding of "structure"
  2. The Invisible Web
    • "Invisible" to the general search engines since they don't index it
    • You'll hear about the "Invisible Web" or the "Deep Web" — same thing
    • Pages that are invisible
      • Disconnected page
      • Page consisting primarily of images, audio, video
      • Flash, Shockwave, compressed files
      • Content retrieved as a result of filling out forms
      • Real time information (ex: stock quotes)
      • Pages that are proprietary
    • Significance of the Invisible Web
      • Bergman's widely-cited statistic is that there are 550 billion documents in the invisible Web
        • Others believe it's more like 20-100 billion
      • Estimated that there's about 300K Web sites with queryable databases
      • 60 of the largest Deep Web sites containing about 750 terabytes of data
  3. Academic Web-based search
    • More academic content is moving to the Web exclusively
    • Part of general trend from print to electronic
    • Much of this is contained in the Invisible Web
  4. Explain how search engines work
    • General
      • Crawlers go out and send information back to the central database
      • Queries go against the central database
      • SE company expertise is in design of the index and design of the query process (including input interface and output formatting and reporting)
    • Academic
      • Crawlers go out, find a database, and what? Index the query interface page? Send some standard queries to the index and save the results?
  5. Should you consider using Google Scholar?
    • Pros
      • A cross-database (federated) search engine
      • Returns snippets from articles (and sometimes abstracts)
      • Indexes the full text (actually, part of the full text) and not just the abstracts and subject terms
      • Can link to your own school's library
    • Cons
      • Secretive about its coverage of specific publishers, journals
      • Limits it searches to the first 100-120K of a page
      • Hasn't been updated much (at all?) since its launch
      • Returns far fewer documents than the native search engines
      • Searching by field is fairly unreliable and counter-productive
  6. What do we want from an academic search engine?
    • Comprehensive
      • Contains lots of journals over lots of topics
      • Goes far back in time
      • Up-to-date
    • Integrated across databases
    • Integrated into a database
    • Transparent as to what it contains or doesn't contain
  7. Recommendation
    • Use Google Scholar
      • as a way to find free, online versions of articles you already know you want
      • like you use Wikipedia — as a good starting place for exploring
    • Use the other Deep Web search tools — Scirus, Turbo10, plus the LII.
    • To do a complete search, you should definitely talk to a librarian and use the Library's immense set of resources.

In-class demonstration and discussion

  1. Google Scholar (the gorilla in the room)
    1. Basics
      1. intitle:"carbon trading" — 472 (271 citations in 2008)
        • Cited by
        • Referenced by (under “Related articles”)
        • Web search
        • Availability at UM library (set up under "Scholar Preferences")
        • "Recent articles" vs. "All articles"
    2. Weird logic — that appears to have been fixed in 2009!
      1. the — 10.6 million records (2.03 billion in 2008)
      2. a — 10.8 million records (13.1 million in 2008)
      3. a OR the — 11.2 million records (13.6 million in 2008)
    3. Subject groups
      1. intitle:Vietnamese — 11,000 records (9,690 in 2008)
      2. allintitle:Vietnam — 98,900 records (816,000 records in 2008) (all subject areas)
      3. allintitle:Vietnam — 23,600 records (29,100 records in 2008) (with all of the subject areas checked)
      4. allintitle: Vietnam OR Vietnamese — 109,000 records (104,000 in 2008; notice that this is less than the 816,000 found for Vietnam alone above)
      5. allintitle: Vietnam OR Vietnamese — 30,000 records (141,000 in 2008) (with all of the subject areas checked)
      6. Publication year strangeness
        1. intitle:Vietnam 1435-2008 — 20,200 records
        2. intitle:Vietnam 1960-2008 — 20,900 records
        3. intitle:Vietnam 2010-2050 — 2 records
  2. Scirus (deep web search competitor)
    1. title:"low carb" "low fat" "weight loss" — 560 hits
      • Ability to filter on the left (sources, file types)
      • Recommendations of refining your search on the left
      • Save or email the results.
      • Sort by relevance or date.
      • Similar results
  3. Google Books (book-based)
  4. UM Library (library-based)
  5. Biznar (specialized deep web search)
  6. BNet (another specialized search tool)
    1. carbon trading
      • Content types to right
      • RSS feeds
  7. Wolfram|Alpha (computational knowledge)
  8. Yahoo Directory (Web site directory)
    • Explore Business sites

Possible blog entries

There are two possible blog entries related to this class — you can write one, both or neither of these. But I would find these interesting.

  1. Write a blog entry on what you observed, what you learned and found interesting, focusing on information that other students might find useful.
  2. Go talk to a Ross librarian. Tell them your topic and ask what 3 to 5 databases or tools that you might find most useful given that topic. See what databases they might tell you to focus on. Use them for a while. By the end of the semester, write a blog entry describing how the information you find in these databases differs from what you would find in the Web at large or what you found in the Deep Web search tools we were introduced to above.

BTW, I would find it rather remarkable if you didn't have in your term project a section or group of resources or something related to information a person could get in a library's database (compared with Deep Web and the Web itself).

Resources

Research tools

Primary

The following sites are traditional Deep Web search sites. Each one of these takes a different way of accessing documents in the Deep Web so they're each worth trying.

  1. Google Scholar
  2. Scirus
  3. IncyWincy — the invisible Web search engine
  4. DeepDyve

Library- and book-based

Each of these tools provides a different way of accessing information in books. Lots of resources are being thrown at Google Books so we should definitely keep our eyes on it as more books enter the system.

  1. University of Michigan Library
  2. Google Books
  3. Amazon Advanced Book Search — Yes, I am including Amazon, the book seller, on this list.
  4. WorldCat

Specialized Deep Web search

Each of these is a deep web search engine but the underlying document sets are specialized.

  1. Green Info Online
    • Review on Peter's Reference Shelf
    • Be sure to look under "Search Options", "Advanced Search", and "Visual Search"
    • At the top of the screen, be sure to look at "Publications" and "New Features!"
    • "GreenFILE offers well-researched information covering all aspects of human impact to the environment. Its collection of scholarly, government and general-interest titles includes content on the environmental effects of individuals, corporations and local/national governments, and what can be done at each level to minimize these effects. Multidisciplinary by nature, GreenFILE draws on the connections between the environment and a variety of disciplines such as agriculture, education, law, health and technology. Topics covered include global climate change, green building, pollution, sustainable agriculture, renewable energy, recycling, and more. The database provides indexing and abstracts for approximately 384,000 records, as well as Open Access full text for more than 4,700 records."
  2. BNet — management, strategy, work life skills & advice for professionals. This is more of a collection of useful business-related information but I couldn't figure out where else in this course to let you know about it. So here it is.
  3. Biznar — deep web business search
  4. Mednar — deep web medical search
  5. ScienceResearch.com — "the world's science all in one place"
  6. Science.gov

General reference and answers

Each of these sites provides access to sets of facts and answers to questions. The first is a computational knowledge engine and the other sites have well-organized sets of traditional articles and entries about specific topics.

  1. Wolfram|Alpha
  2. Information Please Almanac
  3. Encyclopedia.com
  4. Britannica
  5. Wikipedia

Secondary deep web sites

These are worth peeking at if you need some more information. Each one of these provides reliable resources.

  1. InfoMine (UCal, Riverside)
    • Isn't being updated any more but still seems useful
  2. Directory of Open Access Journals — 1673 (1262 in 2008) journals are searchable at the article level, 319,861 (211,294 in 2008) articles.
  3. Bing

Web directories

The purpose of each one of these sites is to provide an organized and categorized sets of Web sites that have been evaluated for usefulness. Each one of these is worth looking for to see if you might get lucky.

  1. Yahoo Directory
  2. Google Directory
  3. Intute — "Helping you find the best websites for study and research"
  4. Librarian's Internet Index
    • Overview: describes who they are, what they do, and what you might expect to get from looking at their site.
  5. Internet Public Library

Pay sites

Each one of these sites is quite useful but they require you to pay so I'm guessing you are out of luck; however, when you get out to the working world remember that these exist. You might be able to gain access to them through your employer.

  1. Web of Science
  2. Scopus
  3. OECD Factbook 2009

In development

I have this listed here just so that I can remember to look at it in future years to see if it has evolved into something more useful than its current condition.

  1. Q-Sensei
    • Includes the Library of Congress (I believe).
  2. DeepPeep
    • About
    • "DeepPeep is a search engine specialized in Web forms. The current beta version tracks 13,000 forms across 7 domains."

Dead

Each one of these was a viable deep web search engine but now they are not worth investigating or don't exist in any form.

  1. CompletePlanet — 70K databases (but appears to be dead as of 2004!)
  2. Turbo10
  3. Microsoft Live Search Academic — closed down in May 2008.
  4. OAIster: find the pearls
    • Integrated into WorldCat in October 2009

Articles

  1. Exploring a 'Deep Web' that Google can't grasp, NYTimes, February 22, 2009
  2. Accessing the Deep Web
  3. Exploring the academic invisible Web
  4. Google Scholar revisited by Peter Jascso, Online Information Review, 32:1, 2008, pp. 102—114.
  5. The Deep Web: Surfacing hidden value
    • As summarized by the editor of The Journal of Electronic Publishing: "Michael K. Bergman, whose BrightPlanet company offers a new approach to search engines, examines the wealth of information that is available only on dynamically created Web sites, those that don't exist except as relational databases until someone seeks information from them. As more sites adopt the dynamic approach to pages, they are creating a challenge for standard search engines. This article looks at some alternatives."
  6. Search engine technology and digital libraries: Libraries need to discover the academic internet
  7. Google Scholar -- a new data source for citation analysis, by Anne-Wil Harzing, February 5, 2008 (7th version).

E-books

  1. Google Book Search
  2. Project Gutenberg
  3. American Memory (by the U.S. Library of Congress)
  4. Million Book Project
  5. Google Electronic Text Archives

Other

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License