370 likes | 480 Vues
Searching in All the Right Places. The Obvious and Familiar To find tax information, ask the tax office Libraries Online Many college and public libraries let you access their online catalogs and other information resources
E N D
Searching in All the Right Places The Obvious and Familiar To find tax information, ask the tax office Libraries Online Many college and public libraries let you access their online catalogs and other information resources Libraries provide online facilities that are well organized and trustworthy Remember that many pre-1985 documents are not yet available online Plus Librarians are real live experts
How Is Information Organized? Hierarchical classification (like a family tree) Information is grouped into a small number of categories, each of which is easily described (top-level classification) Information in each category is divided into subcategories (second-level classifications), and so on Eventually the classifications become small enough for you to look through the whole category to find the information you need This is a process of elimination as much as choosing appropriate subcategories
Important Properties of Classifications Descriptive terms must cover all the information in the category and be easy for a searcher to apply Subcategories do not all have to use the same classifications Information in the category defines how best to classify it There is no single way to classify information
How is Web Site Information Organized? Homepage is the top-level classification for the whole Web site Classifications are the roots of hierarchies that organize large volumes of similar types of information Topic clusters are sets of related links For example, sidebar and top of page navigation links Content information often fills the rest of a page
Alternate Hierarchy Presentations Top level classifications can be expanded individually for next level information Alternately, a tabular form of the tree can be presented for a broader picture at a glance (sometimes called site map) Our NPR homepage example offers both forms
Design of Hierarchies General rules for design and terminology of hierarchies Root is usually at the top (branching metaphor) "Going up in the hierarchy" means the classifications becomes more inclusive or general "Going down in the hierarchy" means the classifications become more specific or detailed The greater-than (>) symbol is a common way to show going down through levels of classification
Levels in a Hierarchy A one-level hierarchy has only one level of "branching"—no subdirectories To count levels, remember There is always a root There are always "leaves"—the categories themselves The root and leaves do not count as levels The NPR hierarchy, drawn as a tree, shows 2 classification levels between the root (homepage) and the leaves (content pages)
Other Hierarchy Considerations Groupings may overlap (one item can appear in more than one category), or be partitioned (every category appears only once) Number of levels may differ by category, even in the same hierarchical tree A single path from root to leaf is a full classification of the leaf content Home > Music > Browse Artists > C > Cave Singers “Tree of Life” biological taxonomy for humans is a path
Searching the Web for Information Individual web sites are carefully organized by their designers (hierarchically, for example) But… no one organizes the entire Web, and it has grown unimaginably HUGE… too huge to just browse looking for specific items Search engines solve the problem Popular Search Engines: Google, Yahoo!, MSN, AOL, Ask
Search Engine Basics A search engine has two basic parts Crawler: Constantly runs, visits sites on the Internet, discovering Web pages and building/updating an index to the content it finds Query processor: Looks up user-submitted keywords in the index and reports back a list of Web pages the crawler has found containing those words
Crawlers Build index telling which Web pages (URLs) contain which words, based on their HTML text. When a crawler visits a web page it: Adds all tokens (words) on the page into the index (words from the title, the body content, anchor text, META tags) Associates the URL for the page with each of these words Then visits all pages that are linked to the page being examined, and does steps 1 through 3 on each Crawlers can miss pages If no page points to it If a page is dynamically created on-the-fly Page has only images, or unknown type (not HTML, PDF, etc.)
Query Processors User submits one or more keywords (the query) Index is consulted for these keywords, producing a list of web page URLs found by the crawler Important to give a good query to get a useful list of pages in reply Query not specific Huge list of unwanted URLs
Multiword Index Searches List of several keywords is an AND-query red fish blue guppy Very common (default) use, means each found page must contain ALL the words Look in the index for the URL list for each word, then scan the lists for URLs common to all URL lists in indexes are often alphabetized to make this faster
Advanced Searches Search engines allow complex queries to get a smaller, more useful list of returned pages (see Google’s Advanced Search page) Logical operators AND: Tells search engine to return only pages containing all terms red AND fish AND blue AND guppy OR: find pages containing any word given, including pages where 2 or more appear marshmallow OR strawberry OR chocolate NOT/-: Excludes pages containing the given word Combinations: tigers AND NOT baseball (chocolate OR strawberry) AND sundae Simpson bart OR lisa OR maggie –homer –marge Use parentheses to make your intent clear
Effective Queries: Narrow the Hitlist Suppose you are writing a report on red giant stars. You issue the Google query red giant and get 4.9 million hits (URLs)… now what? First few pages deal with software and rock bands, so try again… with some restrictions red giant –software –music Now we have 824,000 hits… better, still big, but the early URLs are somehow the ones we need
Ordering the Hits How did Google decide on the order for the 824,000 URLs, and put the “best” ones up top? Order is determined by relevance several ways Top URLs have “red giant” together on the page, in the order given in the query Later down the list are pages with “giant red”, or words separated, or words in anchor text Google enhances this with a relevance score called PageRank for each page
Page Rank Count the links into a page (“important” pages are pointed to by lots of other pages) Each page that links to a target page is considered a "vote" for that target page If the "voting page" is itself highly ranked, this ups the PageRank for the target page Words in anchor text up the PageRank Crawler computes this as it indexes Complete details of the PageRank algorithm is Google proprietary information
Further Constraining Search Some web sites offer search limited only to the pages of that site We can often focus a search by limiting it to URLs in specific domains (like .gov or .edu) or to specific sites (like www.youtube.com). This allows Google’s PageRank to order well the hits we get from the restricted domains
Web Information: Truth or Fiction? Anyone can publish anything on the web Note prevalence of blogs and wikis Some of what gets published is false, misleading, deceptive, self-serving, slanderous, or disgusting If it is on the web it must be true. – NOT! How do we know if the pages we find in our search are reliable?
Do Not Assume Too Much Registered domain names may be misleading or deliberate hoaxes www.whitehouse.gov vs. www.whitehouse.org vs. www.whitehouse.com Look for who or what organization publishes the Web page Respected organizations publish the best information A two-step check for the site's publisher InterNIC (www.internic.net/whois.html) provides the name of the company that assigned the site's IP address, and a link to the WhoIs server maintained by that company Go to the WhoIs Server site and type the domain name or IP address again. Information returned is the owner's name and physical address
Characteristics of Legitimate Sites Web sites are most believable if they have these features: Physical Existence—Site provides a street address, phone number, e-mail address Expertise—Site includes references, citations or credentials, related links Clarity—Site is well organized, easy to use, and has site-searching facilities Currency—Site was recently updated Professionalism—Site's grammar, spelling, and punctuation are correct; all links work
Check and Double Check Remember that a site can have many of the features of legitimacy and still not be authoritative. Example: http://www.dhmo.org/(Hoax about dangers of Dihydrogen monoxide – H2O) Use known authoritative sites to cross check, or consult respected debunkers ( like snopes.com ) When in doubt, check it out. Ask a librarian. Test your assessment skills… check out the Burmese Mountain Dog web page
Summary Libraries are excellent primary resource tools Large libraries have extensive online resources Libraries not only provide information digitally, they also connect us with “pre-digital” archives -- the millions of books, journals, and manuscripts that still exist only in paper form We need software and our own intelligence to search the Internet effectively 5-36
Summary We create search queries using the logical operators AND, OR, NOT, and specific terms to pinpoint the information we seek Once we’ve found information, we must judge whether it is correct by investigating the organization that publishes the page, including checking the credentials of the people who write the content. We must cross-check the information with other sources, especially when the information is important 5-37