Kosmix, Google, and the Deep Web

By kfreelskfreels (1258857418|%a, %b %e at %I:%M%p)

An interesting article posted last week on AltSearchEngines discusses the "love triangle" between Kosmix, Google, and the Deep Web. At the SDForum Search SIG in Silicon Valley, Alon Halevy of Google Labs and Anand Rajaraman of Kosmix discussed their respective company's approaches to the Deep Web. I found this article intriguing since I did my Search Engine Analysis on Kosmix, a relatively new search engine that some are already calling a future rival of Google's.

First off, what is Kosmix?

Kosmix, the company, was founded in 2005 by two talented programmers who sought to "tell {users} more about something". Its founders call the search engine a "guide to the Web - a place to start when you want to browse and discover everything that the Web has to offer". Their strategy operates on a single premise; in a general search engine, like Google, how can you get the best results if you don't know exactly what you're looking for? Web users can go to Kosmix when they want to learn what else is out there about a topic.

Second, What is the Deep Web?

As we learned in this BIT 330 lecture, the Deep Web, or Invisible Web, is all the sites out there that general search engines don't index for various reasons. In the AltSearchEngines article, they described the Deep Web as this:

"The Deep Web is simply the Web behind HTML forms. If you want to buy a car, for example, you might visit Cars.com and search for a used Toyota Prius, priced at less than $15,000 and located near Palo Alto, California. Cars.com will turn your query into an HTML page to present the results to you. A search engine won’t be able to see the page, however, because it was created just for you from a series of databases. The page becomes “lost” in the Deep Web."

Kosmix's Approach to the Deep Web

Kosmix takes a federated approach to analyzing the Deep Web in its search engine. For any search query, Kosmix scans the Web for HTML forms (like the cars.com one) in real-time, evaluates them, and then organizes them. Kosmix has set up an intelligent system to know which Web services to scan for every query, based on Kosmix's "categorization" technology.

"Over the past three years, Kosmix has created a taxonomy of several million nodes, which we organized into a graph, using a combination of humans and algorithms. Editors discover, integrate, and tag Web services to taxonomy nodes in a semi-automated fashion. Algorithms route the user’s query through the set of taxonomy nodes, which enable the engine to decide which Web service to call." (AltSearchEngines)

Don't worry, I didn't understand what that means either at first. But I've played around with Kosmix and here's what I've deduced it to mean; Kosmix has scoured the Web, and built up information on nearly 5 million "categories". Somewhere, in programming language, they have written an algorithm so that the site can respond when a user generates a query. In my simplified version, I imagine they've got a spreadsheet with 5 million rows, one for each category, followed by a column that identifies and tags the category, and another column that lists all the places they can find more information. My pathetic attempt to illustrate a simplified version follows:

Category What we know Where we can find more info
Pumpkin Pie Pumpkin Pie is a food/dessert.. Useful knowledge about pumpkin pie includes recipes, nutritional info, "how-to" videos, shopping, etc… New recipes from the Food Network, “How To” baking videos from 5min.com, real-time tweets about pumpkin pie from Twitter, and caloric information from diet sites like FatSecret, etc…

Google's approach to the Deep Web, however, relies on a different, "less-is-more" strategy. The technical difference is probably difficult to understand, but basically, Google finds HTML forms, send input to the forms on well-defined Deep Web lists and indexes the HTML results. By indexing results, the results a user sees isn't new or unique to their query.

My take on the "love triangle":

The verdict in the search community is out on which approach is better, but it is clear that both Google and Kosmix's Deep Web approaches fit within their overall search philosophy. However, I'm impressed with Kosmix's thorough approach that tailors results to user generated queries. Their approach "taps into these {HTML} forms in real-time", so the information on category pages is the most up-to-date and relevant. Real-time, up-to-date content is becoming increasingly important in our Web 2.0 world. People want to know what's going on now, and they want it to be accurate. I feel that Kosmix might have the step up in their quest to tackle the Deep Web problem for that particular reason.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License