RSS Search analysis

by Scott Moore (samooresamoore in BIT330, Fall 2009)

Summary data

The following tables contains the precision of the top 10 results returned on queries and the number of overlap documents shared between the top 10 results of different search engines.

RSS search (comparing top 10 results)
Google Ice
Bloglines Blog Search Rocket Technorati
Bloglines 4.5 1.0 0.7 0.6
Google Blog Search 5.5 0.6 0.5
Ice Rocket 3.6 0.5
Technorati 2.6
All 0.1

Let's make sure that we know what this table tells us.

Diagonal values
Consider the cell which contains "5.5". This tells us that, on average, the top 10 results returned by Google Blog Search contains 5.5 relevant documents.
Off-diagonal values
Consider the cell which contains "0.7". This says that, on average, the top 10 results returned by Bloglines contained 0.7 documents returned in the top 10 documents of Ice Rocket.

The apparent result is that about 5/9 of the results for Google Blog Search are relevant, just over 3/7 of the results for Bloglines are relevant, just over 1/3 of the results for Ice Rocket are relevant, and around 1/4 of the results for Technorati are relevant. The standard deviation of this year's precision values range from 2.1 to 2.9. Last year's results were fairly similar — 53% by Google Blog Search, 44% by Bloglines, and 33% by Technorati.

Now let's consider the overlap data. There's very little overlap between pairs of RSS search engines: on average, there are 0.65 documents in common between any pair of RSS search engines. This value is nearly identical to last year's overlap rate.

Results

Explanation of statistics

For the individual results, I show for how many students the precision was better for the first search engine, better for the second search engine, or the same for the two search engines. For the Student's paired t, I test the hypothesis that the differences in precision for the two search engines is equal to zero; this test assumes that the data is normally distributed. I used this table of values to test the hypotheses. For the Wilcoxon signed rank test, I am testing the hypothesis that the precisions for the two search engines are selected from the same distribution (no matter what that distribution might be). I used the method described on this page to calculate this statistic.

Differences in precision of the top 10 results

Bloglines vs. Google Blog Search: Test hypothesis that Google Blog Search is better than Bloglines. All tests support the hypothesis.

  • Individual results (B/G/=): 8/15/9
  • Student's paired t: $t_{g,b} = 1.99 > t_{30,95} = 1.697$
  • Wilcoxon: $W_{23} = 125 \Rightarrow z = 11.28 > z_{99.9} = 3.291$

Bloglines vs. Ice Rocket: Test hypothesis that Bloglines is better than Ice Rocket. All tests support the hypothesis.

  • Individual results (B/I/=): 18/10/4
  • Student's paired t: $t_{b,i} = 1.80 > t_{30,95} = 1.697$
  • Wilcoxon: $W_{28} = 155 \Rightarrow z = 10.38 > z_{99.95} = 3.291$

Bloglines vs. Technorati: Test hypothesis that Bloglines is better than Technorati. All tests support the hypothesis.

  • Individual results (B/T/=): 21/6/5
  • Student's paired t: $t_{b,t} = 3.63 > t_{32,99.75} = 3.030$
  • Wilcoxon: $W_{27} = 269 \Rightarrow z = 22.04 > z_{99.95} = 3.291$

Google Blog Search vs. Ice Rocket: Test hypothesis that Google Blog Search is better than Ice Rocket. All tests support the hypothesis.

  • Individual results (G/I/=): 24/8/0
  • Student's paired t: $t_{g,i} = 3.66 > t_{32,99.75} = 3.030$
  • Wilcoxon: $W_{32} = 321 \Rightarrow z = 19.93 > z_{99.95} = 3.291$

Google Blog Search vs. Technorati: Test hypothesis that Google Blog Search is better than Technorati. All tests support the hypothesis.

  • Individual results (G/T/=): 26/4/2
  • Student's paired t: $t_{g,t} = 5.51 > t_{32,99.75} = 3.030$
  • Wilcoxon: $W_{30} = 376 \Rightarrow z = 29.69 > z_{99.95} = 3.291$

Ice Rocket vs. Technorati: Test hypothesis that Ice Rocket is better than Technorati. All tests support the hypothesis.

  • Individual results (I/T/=): 17/9/6
  • Student's paired t: $t_{i,t} = 1.97 > t_{30,90.0} = 1.310$
  • Wilcoxon: $W_{26} = 118 \Rightarrow z = 8.68 > z_{99.95} = 3.291$

Discussion

So, what we have is G > B > I > T with strong levels of significance. Google Blog Search is clearly better than Bloglines and the rest. There really isn't any debate on this point. I would use Google Blog Search for all my searches and, for my most important searches, I would also run the query against Bloglines. Running the query in this second search engine would generally be beneficial because of the low level of overlap between search engines. Beyond that, unless you have some specified reason to do so, I would not recommend that you spend time using Ice Rocket or Technorati for serious research projects.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License