Search Tool Data Analysis

By samooresamoore (1251644900|%a, %b %e at %I:%M%p)

by Brian Hendricks (BrianHeM10BrianHeM10 in BIT330, Fall 2008)

Questions and queries

Web search engines

Software-as-a-Service (SaaS) applications is currently the fastest growing software sector and one of the fastest growing industries in the world. I am interested in specifics reasons why this is the case. I used the web search engines to answer: "What are some of the biggest factors causing the enormous growth rates in the SaaS industry?"

For Google, Yahoo, and Windows Live, I used the query:

"software as a service" growth factors

Blog search engines

One of the most important elements to the future of the SaaS industry is the idea behind cloud computing and hosting applications "in the cloud". Currently, Amazon's Elastic Cloud Computing (EC2) is opening up new frontiers for cloud computing. I used the blog search engines to find blog posts about: "What exactly is EC2?"

For Google Blog Search, Technorati, and Bloglines, I used the query:

Amazon EC2

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 70 20 10
Google 75 15
Yahoo Web 70
All 10
Blog search Technorati Google Blog Bloglines
Technorati 65 0 5
Google Blog 90 0
Bloglines 45
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 0 0 0
10 1 1 1
20 1 2 3
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 0 1 1
10 0 1 1
20 0 1 3
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 0 0
10 0 0 0
20 0 0 0
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 0
10 0 0 0
20 0 0 0

Results

Web search

Summary Statistics

Precision Live Google Yahoo Web
Min 10 20 10
Median 42.5 57.5 52.5
Mean 42.8 54.4 51.7
Mode 15 70 70
Max 80 90 85
Results Overlap L/G L/Y G/Y L/G/Y
Min 0 5 5 0
Median 20 20 20 10
Mean 18.3 20.0 20.6 10.0
Mode 10 10 25 10
Max 35 45 35 25

The above statistics represent general statistics on the precision of results and the overlap of results between search engines. Precision measures how well the search engine returned relevant results and is a proportion of how many relevant results were returned out of how many results examined. Results overlap tracks the percentage of results in Live (L), Google (G), and Yahoo (Y) that appeared in the compared sets. For example, the average amount of results that were precise for Google was 54.4% and on average 20% of the results examined appeared in both Yahoo and Live.

Ranking Overlap (G/Y) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Min 0 0 0 0 0 0 0 0 0
Median 1 1 2 1 2 3 1 3 4
Mean 1.1 1.4 1.6 1.3 2.0 2.6 1.6 2.5 3.7
Mode 1 0 0 1 1 4 1 3 5
Max 4 4 4 4 4 5 4 5 7
Ranking Overlap (Y/G) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Min 0 0 0 0 0 0 0 0 0
Median 1 1 1 1 2 3 2 3 4
Mean 1.1 1.2 1.6 1.5 1.9 2.5 1.9 2.6 3.8
Mode 1 0 1 1 3 3 1 4 5
Max 4 4 4 4 4 5 4 5 7

The above statistics refer to the rankings overlap between Google and Yahoo. Ranking overlap measures the amount of times the first 5, 10, and 20 results of one search engine appear in the first 5, 10, and 20 results of the other search engine. For example, o(10,5) in the GY table is the number of top 10 Google results that appear in top 5 Yahoo results and o(5,20) in GY table is the number of top 5 Google results that appear in the top 20 Yahoo results.

Hypothesis Test - Is Google More Precise Than Live & Yahoo?

Null Hypothesis: Google(Precision) = Live(Precision), Alternative Hypothesis: Google(Precision) > Live (Precision)
Alpha: 5%
Sample Mean(Google): 54.4, Sample Mean(Live): 42.8, Std(Google): 20.1, Std(Live): 22.8
T-Statistic: 1.5266
P-Value: .0688
Decision: Fail to reject the Null Hypothesis
Conclusion: At the 5%, there is not enough evidence to conclude that Google's search results are more precise than Live search results.

Null Hypothesis: Google(Precision) = Yahoo(Precision), Alternative Hypothesis: Google(Precision) > Yahoo(Precision)
Alpha: 5%
Sample Mean(Google): 54.4, Sample Mean(Yahoo): 51.7, Std(Google): 20.1, Std(Yahoo): 22.4
T-Statistic: .3588
P-Value: .3611
Decision: Fail to reject the Null Hypothesis
Conclusion: At the 5%, there is not enough evidence to conclude that Google's search results are more precise than Yahoo search results.

The above tests are 2-sample hypothesis tests of means. It measures the likelihood (p-value) that the null hypothesis is true based on the observed results. Alpha is the minimum p-value needed for the null hypothesis to hold true.

Blog search

Summary Statistics

Precision Technorati Google Blog Search Bloglines
Min 5 25 20
Median 30 42.5 47.5
Mean 33.1 52.5 44.4
Mode 35 40 50
Max 85 100 75
Results Overlap T/G T/B G/B T/G/B
Min 0 0 0 0
Median 0 7.5 5 0
Mean 3.6 9.2 6.9 1.4
Mode 0 5 5 0
Max 25 25 20 10

The above statistics represent general statistics on the precision of results and the overlap of results between blog search engines. Please refer to the "Web Search" statistics for more information on the definitions of precision and overlap of results.

Ranking Overlap (G/B) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Min 0 0 0 0 0 0 0 0 0
Median 0 0 0 0 0 0 0 0 1
Mean 0.3 0.4 0.5 0.4 0.5 0.8 0.7 0.8 1.1
Mode 0 0 0 0 0 0 0 0 0
Max 1 2 2 2 2 3 3 4 4
Ranking Overlap (B/G) o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Min 0 0 0 0 0 0 0 0 0
Median 0 0 0 0 0 1 0 1 1
Mean 0.3 0.4 0.6 0.4 0.5 0.8 0.5 0.9 1.1
Mode 0 0 0 0 0 0 0 0 1
Max 1 2 3 2 2 4 2 3 4

The above statistics refer to the rankings overlap between Google Blog Search (G) and Bloglines (B). Please refer to the "Web Search" statistics for more information on the definitions of ranking overlap and how to read the table.

Hypothesis Test - Is Google Blog Search More Precise Than Technorati & Bloglines?

Null Hypothesis: Google-Blog(Precision) = Technorati(Precision), Alternative Hypothesis: Google-Blog(Precision) > Technorati(Precision)
Alpha: 5%
Sample Mean(G-Blog): 52.5, Sample Mean(Technorati): 33.1, Std(G-Blog): 22.2, Std(Technorati): 21.2
T-Statistic: 2.528
P-Value: .0085
Decision: Reject the Null Hypothesis
Conclusion: At the 5%, there is sufficient evidence to suggest Google Blog Search produces more precise results than Technorati.

Null Hypothesis: Google-Blog(Precision) = Bloglines(Precision), Alternative Hypothesis: Google-Blog(Precision) > Bloglines(Precision)
Alpha: 5%
Sample Mean(G-Blog): 52.5, Sample Mean(Bloglines): 44.4, Std(G-Blog): 22.2, Std(Bloglines): 14.3
T-Statistic: 1.2269
P-Value: .1155
Decision: Fail to reject the Null Hypothesis
Conclusion: At the 5%, there is not enough evidence to suggest Google Blog Search produces more precise results than Bloglines.

The above tests are 2-sample hypothesis tests of means. It measures the likelihood (p-value) that the null hypothesis is true based on the observed results. Alpha is the minimum p-value needed for the null hypothesis to hold true.

Discussion

Web search

Comments

The results of the Web Search overlap are not very interesting as the numbers are relatively similar across the board. It is very interesting that the average overlap between Google and Yahoo (20.6%) and the number of ranking overlaps (3.7/3.8) are slightly off. It was also intriguiging to see that the most common overlap % between Google and Yahoo was 15% higher than the overlap between Live/Google and Live/Yahoo. This means that their may be a strong relationship between Google and Yahoo's search methods and site index.

Recommendations

First, it is concerning that Live's precision was consistently lower than Google and Yahoo's and appears to have been driven up by some large numbers. Be wary of Live's ability to be return precise results; nevertheless, the hypothesis test did not conclude that Live is less precise than Google. Since the overlap of these sites was low (~20%), it is still worth the extra time to use all three sites. The more detailed your query is the more likely you will have similar results across the three search engines.

Key Learnings

I was very shocked to see that the highest overlap between Google and Yahoo was 7 out of 20. My impression was that there results are pretty the same with some slight ranking differences between the two. Also, it was surprising to see Live produce even more dissimilar results and have a consistently lower overlap with Google & Yahoo. I certainly learned to pay closer attention to the results of Google and Yahoo and consider using both when searching.

Possible Further Investigations

Based on the data, there are three additional investigations I would like to complete:

  • How does Google compare to specialized search engines?
  • What is the ranking overlap for the top 50?
  • Does the overlap always grow with each additional 5/10 results?
  • Is there always a strong correlation between the average % of overlapped and o(20,20) number? (Assuming we look at more than 20 results for overlap)
  • Does Google and Yahoo consistently have higher overlaps than Live/Yahoo and Live/Google?

Blog search

Comments

The most striking implications of all the data collected is how unrelated the results are for blog search engines. Although all three blog search engines were relatively precise, they had barely any overlap between each other. As a result, the ranking overlap was almost nonexistant. It is very interesting to note that the mean overlap for G/B was 6.9% and the highest mean overlap ranking was 1.1 results - a pretty accurate ratio. However, this also implies that in most circumstances overlap between the blog search engines occurs outside of the top 10 which is rarely accessed by users. Either each blog search engine's algorithim or data index is quite different to have on average only 1.4/20 results appear on Technorati, Google Blog Search, and Bloglines.

Recommendations

The data suggests that when searching for blog posts it is beneficial to try multiple sources as no single search engine is super-precise and the search engines do not return similar results. However, as the hypothesis test supports, Google Blog Search will generally produce more precise results than Technorati. It will also be beneficial to vary your queries based on the individual search engine's individual syntax options. For the blog search engines, we did not delve into each's advanced search options which may have improved precision and overlap results. Also, each blog search engine excelled at a particular need - Technorati was great at displaying news and consolidating information, Google Blog Search helped find relevant blog feeds, and Bloglines returned really strong individual blog posts.

Key Learnings

From this research, I discovered two new web tools (Technorati and Bloglines) and difficult it can be to use blog search engines. It is very easy to type in a query to these tools, but to really find new blogs to subscribe to or recent posts to read it can be quite a challenge. This statement is supported by the much poorer precision and overlap results of the blog search engines compared to the web search engines.

Possible Further Investigations

Based on the data, there are three additional investigations I would like to complete:

  • What is the ranking overlap between Technorati and Bloglines?
  • Are there any specific syntax changes that dramatically increase precision for each blog search engine?
  • Is the ranking overlap within blog search engines consistently that low?

Methodological Changes

The following changes to our research methodology affects both Web Search and Blog Search and can be used to create more realistic, accurate results:

  • Increase the sample size from 16 to at least 35. Once the sample size reaches 35, the distribution of results can be considered approximately normal.
  • Require that the queries for all three Web/Blog search engines are the same (adjusting for syntax differences when necessary). Some results may have been skewed by certain queries being very different.
  • Encourage more consistency between the types of queries that are entered. The results may have been affected by some queries being very specific (intitle:x inurl:y "Q" or "U") or too simple (football).
  • Establish a clear definition of "relevancy" when measuring precision and "overlap". Some results may have the same title but from a different source which may cause confusion.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License