Web Search experiment

This is due by Sunday, 9/27, at 5pm (so that I can summarize the information for next class). This looks like a ton of work but it's actually not that bad — I was just really obsessive-compulsive with the directions. (I hope they end up being clear enough for you; if not, I hope you'll let me know sooner rather than later.) It will probably be easier for you to complete these instructions if you print them out at the beginning.

You will compare four Web Search engines (Google, Bing, Yahoo!, and Ask) in a controlled, quantitative manner. The point of this experiment is for us, as a class, to gather enough information so that we can make some reasonable conclusions about which general Web search engine is best.

1. Defining the question and queries

  1. We need to define your "home" search engine for this experiment.
    • Get out your MCard.
    • Look at the last digit of your UMID.
    • This digit determines your "home" search engine:
      • 0,1: Ask
      • 2,3,4: Bing
      • 5,6: Google
      • 7,8,9: Yahoo
  2. Define a question that you want to have answered; you will submit a query reflecting this question to each of these Web search engines. It would be better (but in no way required) if the question is related to a possible term project topic.
    • Write more than 1/2 line describing your question. Tell me what types of documents would be most useful and what would be less useful.
    • Don't write a question that references a small, relatively unknown product or company. This simplifies your search too much.
  3. Work in your "home" search engine refining and revising your query until it is returning reasonable (we're not shooting for perfect, just reasonable) results at the top of the results list. Some hints related to this query:
    • Don't work in the non-home search engines while revising your query. Only use your "home" search engine while revising it.
    • I would be surprised if the query consisted of one word (instead of multiple words or phrases) and if it wasn't sophisticated (in the ways we have already learned about) in some way.
    • Be sure that the queries are as equivalent as possible across the three different search engines (allowing only for differences in syntax, such as the different uses of "dash" or "NOT" in different search engines).
    • The queries should be the same across the three search engines except if the search syntax of one search engine differs from another. So, if you use quotes in one query, use quotes in the other queries, too. Further, if you search in the title for one search engine, search the same way in the other search engine.
    • When you are showing what query you used, don't include quotes in the query unless you used quotes in the query. (Hmmmm.) For example, suppose your search query was fastest car acceleration. Don't write “fastest car acceleration” unless you used quotes at the beginning and end of the query when you typed it into the search engine. If you simply queried using those three words, then a standard way of representing this is to start and end your query with square brackets; thus, you would write [fastest car acceleration].

2. Gathering your data

  1. Submit the query to each search tool. Do this within 30 minutes from beginning to end. The reason for this requirement is that the databases change over time and we want each search engine to have an equal opportunity.
  2. Print out a list of the first 10 resources as returned by each search engine. You will be using these pages to do your analysis (see below). Change your question if an appropriate query doesn't return at least 10 results for all of the search engines.
  3. Before closing the Web page, save the page to your file space. The reason for doing that is so that you can use the links embedded in the page later in the assignment. The printouts themselves are often not sufficient for finding the Web page itself; some parts don't print sometimes.

3. Analyzing your data

Report on the results in the following way:

  1. For these first 10 resources returned by the search engine, determine which of the Web pages are applicable and useful. You should look at each Web page (not just their summaries) in order to determine this. On the printout you should write a P (for precision) in the right margin next to a Web page that you consider to be applicable and useful.
    • An indented result (that is, some sub-page of a site just listed) counts as one of these "resources".
    • If there is a group of images listed as a resource, then that group of images counts as one resource. You'll have to judge the whole group as to whether it's applicable and useful.
    • Same thing for news results — if there is a group of news results listed as a resource, then that group of news stories counts as one resource and needs to be judged as a whole. Are you glad those news results are listed or not? Did they help you find out the answer to your query?
  2. Now you need to summarize the results for each search engine.
    • Count up the number of times you wrote P on each page.
    • Write down the results for rows [1] — [4] in the table below.
  3. Determine the overlap of the results returned by the different search engines; that is, we're going to count the number of times a result for one search engine (whether relevant or not) appears in the results of another search engine.
    • Ask/Bing
      • On the Ask printout, go through each resource, and if you find it on the Bing results, then put a B in the right margin next to the resource on the Ask printout.
      • Count up the number of times you wrote B and put this number on line [5].
    • Ask/Google
      • On the Ask printout, go through each resource, and if you find it on the Google results, then put a G in the right margin next to the resource on the Ask printout.
      • Count up the number of times you wrote G and put this number on line [6].
    • Ask/Yahoo
      • On the Ask printout, go through each resource, and if you find it on the Yahoo results, then put a Y in the right margin next to the resource on the Ask printout.
      • Count up the number of times you wrote Y and put this number on line [7].
    • Bing/Google
      • On the Bing printout, go through each resource, and if you find it on the Google results, then put a G in the right margin next to the resource on the Bing printout.
      • Count up the number of times you wrote G and put this number on line [8].
    • Bing/Yahoo
      • On the Bing printout, go through each resource, and if you find it on the Yahoo results, then put a Y in the right margin next to the resource on the Bing printout.
      • Count up the number of times you wrote Y and put this number on line [9].
    • Google/Yahoo
      • On the Google printout, go through each resource, and if you find it on the Yahoo results, then put a Y in the right margin next to the resource on the Google printout.
      • Count up the number of times you wrote Y and put this number on line [10].
  4. Determine the overlap of the results returned by all of the search engines; that is, we're going to count the number of times a result appeared in all of the search engines.
    • On the Ask printout, go through each resource and count up how many have a B, a G, and a Y next to it.
    • Put this number on line [11].
Data Your value
[1] Ask precision
[2] Bing precision
[3] Google precision
[4] Yahoo precision
[5] Overlap(A,B)
[6] Overlap(A,G)
[7] Overlap(A,Y)
[8] Overlap(B,G)
[9] Overlap(B,Y)
[10] Overlap(G,Y)
[11] Overlap(A,B,G,Y)

Again, these numbers are a simple count of either the precision values or the overlap values. I will do all the calculations later.

4. How to wrap it up

  1. You need to add your data to this page by Sunday at 5pm. View the page after you add the data to it to see that you formatted it correctly.
  2. You also need to keep those search results output pages. We might be using them again.

I will take all of your data, consolidate it, and report back to the class what the results are.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License