1 / 37

Searching in All the Right Places

Searching in All the Right Places. The Obvious and Familiar To find tax information, ask the tax office Libraries Online Many college and public libraries let you access their online catalogs and other information resources

Télécharger la présentation

Searching in All the Right Places

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching in All the Right Places The Obvious and Familiar To find tax information, ask the tax office Libraries Online Many college and public libraries let you access their online catalogs and other information resources Libraries provide online facilities that are well organized and trustworthy Remember that many pre-1985 documents are not yet available online Plus Librarians are real live experts

  2. How Is Information Organized? Hierarchical classification (like a family tree) Information is grouped into a small number of categories, each of which is easily described (top-level classification) Information in each category is divided into subcategories (second-level classifications), and so on Eventually the classifications become small enough for you to look through the whole category to find the information you need This is a process of elimination as much as choosing appropriate subcategories

  3. Important Properties of Classifications Descriptive terms must cover all the information in the category and be easy for a searcher to apply Subcategories do not all have to use the same classifications Information in the category defines how best to classify it There is no single way to classify information

  4. How is Web Site Information Organized? Homepage is the top-level classification for the whole Web site Classifications are the roots of hierarchies that organize large volumes of similar types of information Topic clusters are sets of related links For example, sidebar and top of page navigation links Content information often fills the rest of a page

  5. Alternate Hierarchy Presentations Top level classifications can be expanded individually for next level information Alternately, a tabular form of the tree can be presented for a broader picture at a glance (sometimes called site map) Our NPR homepage example offers both forms

  6. Design of Hierarchies General rules for design and terminology of hierarchies Root is usually at the top (branching metaphor) "Going up in the hierarchy" means the classifications becomes more inclusive or general "Going down in the hierarchy" means the classifications become more specific or detailed The greater-than (>) symbol is a common way to show going down through levels of classification

  7. Levels in a Hierarchy A one-level hierarchy has only one level of "branching"—no subdirectories To count levels, remember There is always a root There are always "leaves"—the categories themselves The root and leaves do not count as levels The NPR hierarchy, drawn as a tree, shows 2 classification levels between the root (homepage) and the leaves (content pages)

  8. Other Hierarchy Considerations Groupings may overlap (one item can appear in more than one category), or be partitioned (every category appears only once) Number of levels may differ by category, even in the same hierarchical tree A single path from root to leaf is a full classification of the leaf content Home > Music > Browse Artists > C > Cave Singers “Tree of Life” biological taxonomy for humans is a path

  9. Searching the Web for Information Individual web sites are carefully organized by their designers (hierarchically, for example) But… no one organizes the entire Web, and it has grown unimaginably HUGE… too huge to just browse looking for specific items Search engines solve the problem Popular Search Engines: Google, Yahoo!, MSN, AOL, Ask

  10. Search Engine Basics A search engine has two basic parts Crawler: Constantly runs, visits sites on the Internet, discovering Web pages and building/updating an index to the content it finds Query processor: Looks up user-submitted keywords in the index and reports back a list of Web pages the crawler has found containing those words

  11. Crawlers Build index telling which Web pages (URLs) contain which words, based on their HTML text. When a crawler visits a web page it: Adds all tokens (words) on the page into the index (words from the title, the body content, anchor text, META tags) Associates the URL for the page with each of these words Then visits all pages that are linked to the page being examined, and does steps 1 through 3 on each Crawlers can miss pages If no page points to it If a page is dynamically created on-the-fly Page has only images, or unknown type (not HTML, PDF, etc.)

  12. Query Processors User submits one or more keywords (the query) Index is consulted for these keywords, producing a list of web page URLs found by the crawler Important to give a good query to get a useful list of pages in reply Query not specific  Huge list of unwanted URLs

  13. Multiword Index Searches List of several keywords is an AND-query red fish blue guppy Very common (default) use, means each found page must contain ALL the words Look in the index for the URL list for each word, then scan the lists for URLs common to all URL lists in indexes are often alphabetized to make this faster

  14. Advanced Searches Search engines allow complex queries to get a smaller, more useful list of returned pages (see Google’s Advanced Search page) Logical operators AND: Tells search engine to return only pages containing all terms red AND fish AND blue AND guppy OR: find pages containing any word given, including pages where 2 or more appear marshmallow OR strawberry OR chocolate NOT/-: Excludes pages containing the given word Combinations: tigers AND NOT baseball (chocolate OR strawberry) AND sundae Simpson bart OR lisa OR maggie –homer –marge Use parentheses to make your intent clear

  15. Effective Queries: Narrow the Hitlist Suppose you are writing a report on red giant stars. You issue the Google query red giant and get 4.9 million hits (URLs)… now what? First few pages deal with software and rock bands, so try again… with some restrictions red giant –software –music Now we have 824,000 hits… better, still big, but the early URLs are somehow the ones we need

  16. Ordering the Hits How did Google decide on the order for the 824,000 URLs, and put the “best” ones up top? Order is determined by relevance several ways Top URLs have “red giant” together on the page, in the order given in the query Later down the list are pages with “giant red”, or words separated, or words in anchor text Google enhances this with a relevance score called PageRank for each page

  17. Page Rank Count the links into a page (“important” pages are pointed to by lots of other pages) Each page that links to a target page is considered a "vote" for that target page If the "voting page" is itself highly ranked, this ups the PageRank for the target page Words in anchor text up the PageRank Crawler computes this as it indexes Complete details of the PageRank algorithm is Google proprietary information

  18. Further Constraining Search Some web sites offer search limited only to the pages of that site We can often focus a search by limiting it to URLs in specific domains (like .gov or .edu) or to specific sites (like www.youtube.com). This allows Google’s PageRank to order well the hits we get from the restricted domains

  19. Web Information: Truth or Fiction? Anyone can publish anything on the web Note prevalence of blogs and wikis Some of what gets published is false, misleading, deceptive, self-serving, slanderous, or disgusting If it is on the web it must be true. – NOT! How do we know if the pages we find in our search are reliable?

  20. Do Not Assume Too Much Registered domain names may be misleading or deliberate hoaxes www.whitehouse.gov vs. www.whitehouse.org vs. www.whitehouse.com Look for who or what organization publishes the Web page Respected organizations publish the best information A two-step check for the site's publisher InterNIC (www.internic.net/whois.html) provides the name of the company that assigned the site's IP address, and a link to the WhoIs server maintained by that company Go to the WhoIs Server site and type the domain name or IP address again. Information returned is the owner's name and physical address

  21. Characteristics of Legitimate Sites Web sites are most believable if they have these features: Physical Existence—Site provides a street address, phone number, e-mail address Expertise—Site includes references, citations or credentials, related links Clarity—Site is well organized, easy to use, and has site-searching facilities Currency—Site was recently updated Professionalism—Site's grammar, spelling, and punctuation are correct; all links work

  22. Check and Double Check Remember that a site can have many of the features of legitimacy and still not be authoritative. Example: http://www.dhmo.org/(Hoax about dangers of Dihydrogen monoxide – H2O) Use known authoritative sites to cross check, or consult respected debunkers ( like snopes.com ) When in doubt, check it out. Ask a librarian. Test your assessment skills… check out the Burmese Mountain Dog web page

  23. Summary Libraries are excellent primary resource tools Large libraries have extensive online resources Libraries not only provide information digitally, they also connect us with “pre-digital” archives -- the millions of books, journals, and manuscripts that still exist only in paper form We need software and our own intelligence to search the Internet effectively 5-36

  24. Summary We create search queries using the logical operators AND, OR, NOT, and specific terms to pinpoint the information we seek Once we’ve found information, we must judge whether it is correct by investigating the organization that publishes the page, including checking the credentials of the people who write the content. We must cross-check the information with other sources, especially when the information is important 5-37

More Related