Class #2: Google is not enough!
This class is an overview of search - what is there to use other than google?
People look for information all the time - for class, for personal reasons, and in job searches.
Try Cha Cha! Text 242 242 and ask any question you want, and it will respond with the answer.
There has been a big trend in venture capital-backed mobile search programs.
Professor Moore loves to search for random, sometimes pointless pieces of information, like the height of college hockey coaches.
IBM says that by 2015, the amount of data will double every 11 hours - wild fact. Technology (hardware and software) is also increasing at an alarming rate.
What is knowledge? Knowledge used to be just knowing about something. These days, data is increasing so rapidly, that knowledge now means "the ability to learn and the ability to FIND as well". Hence, the importance of searching.
Big point of the semester: you will need many different tools to meet your diverse and changing data searching needs - we constantly need new searching tools!
Back in Professor Moore's day, you needed to talk to a librarian or someone else who knew what they were doing to find information. Nowadays, you can just type a random statement into Google and get what you need.
Changes over time:
- Then —> Now
- Experts —> You & Me
- Well-defined queries —> Ill-defined
- Thousands of documents —> Billions of documents
Search engine capabilities
What exactly does a search engine or search tool do?
- generates structured query results
- provides a way of exploring that structure (ex: google gives you 10 pages, you can click for more than 10 pages)
- helps you monitor changes in search results (ex: google can submit queries every day and tell you the changes in search results)
Categorizing search engines
- query terms (either "and" or "or", most search engines are default "and")
- search targets (HTML, addresses, images, blogs, PDF, books, video, maps, etc.)
- indexed information (information searched is the actual text in the document vs. information searched is the meta-information about the page)
Special search terms
Different search engines have special search terms and operators to search more specifically:
- + means that the word HAS to be in the document
- - means that the word CAN NOT be in the document
- * is a wildcard character
- " gives you the EXACT quote in the exact order
- intitle: gives you the documents that have the exact word in the TITLE of the document. Makes search results more specific.
- inurl: returns pages where the searched term is in the actual URL
- site: searches a specific website
Evaluating a search engine's performance
How do you evaluate a search engine's performance?
- Hope for overlap between the "relevant" search results (what you were looking for) and the "retrieved" search results (what you actually get back). Leaves you with 3 sub-sets: not retrieved but relevant (A), retrieved and relevant (B), and retrieved but not relevant (C).
- Recall = B / (A+B) —> want it to be as high as possible
- Precision = B / (B+C) —> want it to be as high as possible
- Recall and precision used to be equally as important. Nowdays, precision is much more important.
- Recall is impossible to calculate, since you would need to know how relevant the unretrieved documents are, which you don't have.
- Precision can be calculated, because you are looking through all the retrieved results.
- Google only cares about the top 3 search results. They don't have about the precision of the top 1000 documents, just the precision of the top 3. The first assignment will study the precision of different search engines in the top 20 search results. How many are relevant?
- Important point: relevance is VERY subjective and totally user-defined in the very moment he or she is searching.
How does the process of searching work?
- Person goes to search engine, enters a query
- Search engine goes to the searchable information (documents, meta-information, etc.)
- Search engine creates results for the user to look like
Differences among search engines
How do search engines differ from one another?
- different queries supported
- different types of automation
- content (pages, categories, paid links, etc.)
- format of results
- delivery form
- subset of web (target, quality of coverage, where the information comes from)
- searchable information (how frequently is it updated?
- search engine itself (quality of experience, quality of responsiveness)
Googling isn't always the right thing to do when searching for information! Students in this course will learn how to efficiently and effectively find the right information using the right tool.