1 / 58

Hidden Universes of Information on the Internet

Russ Haynal. Internet Instructor, Speaker, and Paradigm Shaker. 21015 Forest Highlands Ct Ashburn, VA 20147. Phone : 703-729-1757 russ@navigators.com. http://navigators.com. Hidden Universes of Information on the Internet. Rev. 04/2013.

omar
Télécharger la présentation

Hidden Universes of Information on the Internet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Russ Haynal Internet Instructor, Speaker, and Paradigm Shaker 21015 Forest Highlands Ct Ashburn, VA 20147 Phone : 703-729-1757 russ@navigators.com http://navigators.com Hidden Universes of Information on the Internet Rev. 04/2013 Note: If you send me an email, put “internet training” in the e-mail's subject

  2. Course Outline specific_page.html • Introduction to Internet Architecture • Preparing for a search • “Persona” Issues • Search Tools - In Depth • Advanced Features • Specialized Resources • Source Evaluation • Demo Sites and Summary Online Web page = http://navigators.com/opensource.html

  3. Disclaimer • This session illustrates a wide variety of search tools, techniques and research methods • Consult your organization’s policies to verify if these methods are approved for your type of Internet connection

  4. Internet Definition “A large collection of Inter-connected networks and computers” “A new fundamental form of communication that will absorb other communication channels” Internet represents a once per thousand year event Last such event = Gutenberg printing press Are You Literate in Today’s Online World?

  5. Internet’s Growth stats.html

  6. Number of hosts in each Domain stats.html Top Level Domains Source: www.isc.org as of July 2011

  7. Example Backbone Maps isp.html Sprint Level 3 AT&T C&W Verizon

  8. Backbones Connecting traceroute.html For a complete picture, initiate traceroutes from within several different backbones Backbone ISP- A Large organization Backbone ISP- B Private Peering Web hosting center Destination regional ISP #2 regional ISP #1 Exchange Point Server Backbone ISP Client(PC) Regional ISP Enterprise LAN/Wan

  9. Exchange Point Traffic isp.html • Notice the daily fluctuations - Analysts may want to “schedule” their research • Traffic continues to grow rapidly in many locations Source: http://www.hkix.net

  10. Router Router Router Router Router Router Router Router How Does it Work? traceroute.html • Internet started as “Packet Switching Networks” using TCP/IP (Transmission Control Protocol - Internet Protocol) • Every Internet connection has a unique IP Address consisting of 4 numbers, each number has a range of 0-255 (ie. 198.211.16.134) • Internet IP numbers are allocated through a hierarchy • IANA --> ARIN/RIPE/APNIC/LACNIC/AFRINIC --> ISP’s /Company/Country • Routers direct your packets of information along the “preferred” path Note: The next generation of IP Address space (IPV6) is quite LARGE 3,911,873,538,269,506,102 IP #’s per square meter of the Earth's surface 4,500,000,000,000,000 IP#’s for every observable star in the known universe

  11. Domain Name System domain_name.html • The Domain Name System (DNS) associates alpha-numeric names with IP addresses • Names are registered with commercial registrars such as Go Daddy or country-specific registrars • DNS Servers are distributed throughout the Internet - They act as a set of inter-linked phone books • You enter “www.navigators.com”, and the DNS servers match it to “198.171.173.51” • Historical meaning for domain names • .com=commercial .net= Internet Provider .org = non-profit • .uk = United kingdom .pk= Pakistan .ru = Russia • Reality…. Many country domain names are now for sale to ANYONE from ANYWHERE

  12. Web Server / Web Site Web site is the content The web server is a computer loaded with server software and a reliable Internet connection. Web pages = name.html Graphics =name.gif =name.jpg

  13. A more complex environment Online Hosting Web Browser Data Base typed form Application server Web server page  data • Internet users interact with web server • Web server query is passed along to data base. • The content of the database is only displayed TEMPORARILY in a web page that is created in response to USER-actions. • Most database content is unreachable by search engines

  14. Accessing a Web Page A series of communications is involved to view a web page 1. Browser requests URL: http://www.company.com/sales/gadget.html company.com 2. Connect via Internet to web server company.com Sales gadget.html (background communications: cookies, browser info, etc) logo.gif 4. Browser displays gadget.html, requests graphics, and eventually terminates connection to the server 3. Server sends gadget.html from its sales directory “Document not found”? - Try shortening the URL!

  15. Course Outline specific_page.html • Introduction to Internet Architecture • Preparing for a search • “Persona” issues • Search Tools - In Depth • Advanced Features • Specialized Resources • Source Evaluation • Demo Sites and Summary

  16. Reports Access logs Introduction to “Persona” persona.html As you surf the Internet, you give-off a certain persona • While viewing a web page (URL1), You click on a hyperlink to visit another web page (URL2) • Your web browser sends “environment variables” to the web server • Webmaster’s use this information to determine information about you and your organization (physical location, your interests, software, etc.) Web Server URL1 Analyst Webmaster URL2 Internet Access You should always know what websites know about you

  17. Persona Details persona.html • Your persona is communicated to every web server that you visit • You should be explicitly aware of your persona before you visit any website. For example, should you visit: • badguy.com from agency.gov? Your persona is communicated via “environment variables” such as: • REMOTE_HOST = This is the name associated with your IP Number. • REMOTE_ADDR= This is the IP number of your computer, or proxy. A webmaster could do a traceroute to see how you are connected. • HTTP_REFERER = This is the URL of the page you were previously viewing. Be careful on how you create web pages. For example, do you want to reveal the following?: • http://badguy.com is listed on http://intranet.agency.gov/joe_smith/investigation_targets.html? • Your persona may also be transmitted via Java Applets such as ga.js and urchin.js (google analytics)

  18. A Typical Scenario... persona.html searchtool.com searchtool.com webmaster knows your “search terms” destination.com webmaster knows what “search terms” you used to find them “search terms” webmaster Analyst hits http://searchtool.com/query=searchterms page destination.com Persona: - agency.gov OR - town.ninja.com webmaster

  19. Always check your Persona persona.html Important Note: This testing page is most accurate when you click on a link to arrive at this page. • You can also search for: proxify who am I This is a key paragraph to look for… If this is missing, then no referring URL is being passed

  20. Think before you click... persona.html • Does your connection method leak a Referring URL? • IF IT DOES... do NOT “Click” on your search results http://www.google.com/query=terrorist_&start=110 Referring URL Hover over the link to see its URL • A click on this search result will tell the webmaster at orgnet.com that you are searching for “terrorist”

  21. Exposing a “less recognizable” persona Analyst #1: uses agency.gov persona to visit “targets” Analyst #2: uses “ninja.com” persona to visit “targets” Now “ninja” persona may be recognized as “agency.gov” visitor The “parallel visit” Problem... Analyst #1 target.com Even with no http_referer, a webmaster can still make the association due to high volume hits or similar usage patterns. agency.gov Analyst #2 ninja.com The “portal” Problem... Agency_portal.com/page_names Analyst #1 target.com agency.gov Persona=agency.gov + referrer = portal Analyst #2 ninja.com Persona=ninja.com + referrer = portal

  22. Plan out your Internet Research search_methodology.html • Spell it Out - Define the topic, spell it out, key words, acronyms, “what” and “who” • Strategize - Choose your approach, which online resources, search tools • Search - Get online, execute, stay focused, use advanced search features • Sift - Filter the results, follow the leads • Save – Make bookmarks, take notes, organize results, share with co-workers.

  23. Spell out the topic... search_methodology.html 1. Name of topic, and what do you want to learn about the topic: __________________________________________________________________ __________________________________________________________________ 2. Spell out the topic (words, acronyms, abbreviations) generic, simple terms obscure, specific terms _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ 3. Make a list of “who” might publish such information (industry association, government agency, NGO’s, user group etc.) __________________________________________________________________ __________________________________________________________________

  24. Follow All Good Leads in Parallel multiple_browsers.html Page A Page 1 Page Y Many users only follow one good lead at a time link1 link2 link3 linkX linkY linkZ linkM linkN linkO Results linkA linkB linkC Valuable links B&C never get explored... Page 1 Page Y Page A linkX linkY linkZ linkM linkN linkO link1 link2 link3 Results linkA linkB linkC Page B • Right-click to open each link in its own browser window (or tab) • Switch between open windows using “ALT-TAB” or “ALT-TAB-TAB-TAB” • Note: http_referrer is still transmitted • Do NOT launch multiple browsers from desktop or start-menu. link1 link2 link3 Page C link1 link2 link3

  25. Course Outline specific_page.html • Introduction to Internet Architecture • Preparing for a search • “Persona” issues • Search Tools - In Depth • Advanced Features • Specialized Resources • Source Evaluation • Demo Sites and Summary

  26. Overview of Internet Search Tools search_tool_intro.html • Search Engines (Google, search.yahoo.com) • Large database – text from billions of clickable pages • Directories (dir.yahoo.com, www.dmoz.org) • Manually built subject trees–links to millions of web sites • “User Pages” (Joe’s guide to widgets) • Built by subject experts- hundreds of topic-related links Pick the right tool... Each Tool has strengths and weaknesses

  27. Internet Directory (i.e. dir.yahoo.com, www.dmoz.org) search_tool_intro.html • Links are grouped by topic • Pages are manually built • Good for early stages of search, general subjects Filer may not be a subject-expert Directory URL’s & Descriptions (submitted by Users)

  28. Typical Format of a Directory search_tool_intro.html Subject category:use to navigate Alias to another branch Number of listings in this sub-category Discover “user pages” on this subject

  29. Searching a directory... search_tool_intro.html wireless • Searches the text within the directory’s own web pages. • Use search terms that would appear in: • category titles • web site titles • web site’s brief description • You are NOT searching the websites – just their brief description

  30. Search Engines(ie. Google.com, bing.com) search_tool_intro.html Search Engine Site Web Servers copied Web page Your PC • Search Engine’s “robot” explores Internet, and copies web pages into its database • Supports very detailed keyword searches • Take the time to learn about the features & options of the search engine Search Interface Robot Indexer cached Web pages Indexed Database You must envision what the target page will look like. “Use your imagination” Try adding the words “resume” or “curriculum vitae” to your search terms

  31. Search Engine Comparison search_tools.html • http://ranking.thumbshots.com – Compares the first sixty hits from two search engines you select • Notice on this search for “jihad’, only 12 out of 60 hits appeared in both Google and Yahoo… Most hits are unique to each search engine. • News, forums and analysis of search engines.

  32. Pay Attention to search results cached.html Clustering – Google showing a maximum of 2 hits per domain Indentation = clustering See hits from only that domain Cached = Google’s local text copy of the page. Graphics will still be downloaded from the remote website, unless you get to a “text only” version of cached Problem: Getting a “text only” version of Google’s Cached has become a complicated process. For a detailed set of options see: navigators.com/cached.html

  33. “User Pages” search_tool_intro.html • Usually focused on a specific subject • Developed by “experts” in that field(or just a person with passion for subject) • Often contain “the best” online resources Potential weblink Info Expert

  34. Finding “User pages” search_universes.html • Announced to Dmoz and other directories • Linked at wikipedia, wikimapia • Groups of users at forums, blogs and mailing lists • Watch for sites labeled: “Joe’s ultimate guide to widgets” • “User pages” often point to other “user pages” • “Surfing Upstream” from several related sites (covered in Hidden Universes part 2) • Ask other researchers – there are several sites that everyone knows as “the best” • Interactive, live communication (Chat, telephony, virtual worlds)

  35. Wiki ____ • A Wiki allows immediate creation and editing of pages by “anyone” • Wikipedia.org – Encyclopedia that can be instantly edited by ANY Internet user. • Good starting point for many subjects to gain an overview of the topic • Page can be biased from the most recent editor • Some entries get “locked-down” due to vandalism • old.wikimapia.org – same concept applied to google maps • “map type” google map: zoom to the right location • “map type”  “wikimapia classic” : to see comments • To learn about the authors: Click on the comment box  menu  history  the user’s name  stats  then clicking on the stat numbers listed shows every place that user has added

  36. old.wikimapia.org “map type”  “wikimapia classic” Anybody can add labels & comments onto map

  37. Mailing List search techniques search_universes.html • Many mailing lists are mentioned on web pages • Use google or bing to search: • Your_Searchterms “mailing list”– will find some mailing lists that talk about your search terms • Your_searchterms “mailing list” reply thread index – will find postings from the mailing list that have been archived onto the web Websites that host mailing lists tend to have great content

  38. Blogs and Forums • A Web Log (blog) is usually owned by one person. • Owner can post a log of their daily activities, or post ongoing comments about a topic. • Others may also be allowed to add comments onto the blog • Wordpress and blogger are popular sites • Forum – An online discussion site focused on a particular topic • Many users can participate by posting messages. • Moderators may “police” comments that are considered off-topic • Try searching for: • Searchterms forum post - to find a forum that discusses your topic • Searchterms forum post replies views – to find individual threads and messages that discuss your topic • Membership requirements are a barrier to search engine robots • Vbulletin is a popular program used on many websites

  39. Course Outline specific_page.html • Introduction to Internet Architecture • Preparing for a search • “Persona” issues • Search Tools - In Depth • Advanced Features • Specialized Resources • Source Evaluation • Demo Sites and Summary

  40. “The web” is TINY search_universes.html Total Online material • Many detailed searches are a two-step process • first find the specialized database • then type a very specific query into the database. Mailing Lists Chat VOIP World Wide Web (pages.html) Email Search engines 1. Initial Search Blogs Forums 2. Detailed Search Specialized Databases Multi- media ( 1000X larger than the web ) Closed systems

  41. Lists of Search Engines search_tool_specialized.html • For specific information, use a specialized search tool • Get “deeper” results than a general search engine • Thousands of search engines are listed • Search engines are grouped according to the subject they cover 70,000 databases .com .net 55,000 public record databases

  42. Specialized Search Engines search_tool_specialized.html • A phone book for the entire U.S. Includes reverse look-ups • List of manufactures and companies from around the world • Real-time tracking of ships from around the world • Federal Register and much more Specialized search tools contain content that search engines can’t reach

  43. Country-Specific Resources universes_part2.html • The online training page links to many additional resources such as news, language translators, and country-specific resources. (Hidden Universes part 2) 20,000 news sources Phone books

  44. Course Outline specific_page.html • Introduction to Internet Architecture • Preparing for a search • “Persona” issues • Search Tools - In Depth • Advanced Features • Specialized Resources • Source Evaluation • Demo Sites and Summary

  45. Source evaluation sesseval.html • Pick apart the URL: • Determine where “ownership” of the web page begins • www.facebook.com/joesmith/info.html • www.joesmith.com/stuff/info.html • Browse the directories (shorten URL if necessary) • Look at domain’s home page - Is it a web hosting site? Is “pathname” a user account? • Search for “links to” the page, shows who links to that specific web page. (command = link: ) • IF the domain home page looks like the “owner” of the content, then you can move forward with whois and traceroute

  46. Source Evaluation - Using WHOIS whois.html • Domain names are “registered” at Internet registrars (global, country-specific) • Each registrar develops its own policies • may authenticate requester of domain name (.gov, .mil) • may accept anyone (with money) (.com, .org, .net, .tv, .pk, etc) • Registrants provide “point of contact” information, for at least invoicing purposes • Domain “point of contact” information is often available from the registrars’ database via a “WHOIS” query • WHOIS contents may be inaccurate, although usually the email, or postal address will be correct to receive renewal invoice

  47. Performing a “Whois Query” whois.html • “whois” reveals the “owner” of a domain (searchenginewatch.com) Administrative contact: Ron Doobay HAYMARKET HOUSE 28-29 HAYMARKET LONDON SW1Y 4RX UK +44.2074849700 +44.2079302238 dns@incisivemedia.com Technical contact: Domain Administrator 3rd Floor Prospero House 241 Borough High Street Borough London SE1 1GA UK +44.2070159370 +44.2070159375 corporate-services@netnames.com Created on: 1998-03-20 Expires on: 2015-03-19 Domain name servers: NS3.INCBASE.NET 85.133.68.200 NS2.INCBASE.NET 62.140.213.136 NS1.INCBASE.NET 62.140.213.135 • Spam concerns has lead to many domain names being registered via “privacy enhanced” options

  48. Whois Complications whois.html = Integrated database of: .com, net, org, biz, coop. museum, edu, info, aero, arpa, int • Intense competition between registrars makes WHOIS more difficult • Some registrars block whois queries originating from competing registrars, or from popular third-party whois sites • Queries may also get blocked if too many queries originate from the same IP# (i.e. one proxy for many researchers ) Registrars Third-Party Whois Sites = permitted WHOIS Query = query may be blocked

  49. Traceroute traceroute.html • Shows a network path between 2 machines • Traceroute designed to help de-bug network connections • Can initiate traceroute from your workstation, or from public “traceroute servers” located throughout the Internet • Each Internet provider has their own naming convention for their infrastructure • Location labels: City names or 3-letter airport codes • Exchange points (LINX, HKIX, AMS-IX) • Infrastructure Topology (T3, FDDI, GE) • Recognize that a website can be hosted anywhere • Could be at organizations’ own site, but may be hosted at a well-connected web hosting facility

  50. Results of Traceroute traceroute.html traceroute Output from WWW.Telcom.Arizona.EDU to www.nsa.gov: traceroute to www.nsa.gov (65.213.217.241), 30 hops max, 40 byte packets 1 128.196.128.253 (128.196.128.253) 1 ms 1 ms 1 ms 2 192.80.43.25 (192.80.43.25) 1 ms 1 ms 1 ms 3 192.80.43.58 (192.80.43.58) 1 ms 1 ms 1 ms 4 207.250.65.133 (207.250.65.133) 5 ms 5 ms 5 ms 5 core-01-ge-2-2-0.phnx.twtelecom.net (209.234.146.45) 5 ms 5 ms 5 ms 6 core-02-so-0-0-0-0.lsag.twtelecom.net (168.215.53.73) 17 ms 17 ms 17 ms 7 tran-01-ge-0-3-0-0.lsag.twtelecom.net (168.215.54.98) 17 ms 17 ms 17 ms 8 500.POS1-1.GW3.LAX1.ALTER.NET (208.222.8.245) 17 ms 17 ms 17 ms 9 122.at-0-0-0.CL2.LAX4.ALTER.NET (152.63.52.246) 18 ms 17 ms 18 ms 10 0.so-0-0-0.TL2.LAX9.ALTER.NET (152.63.115.146) 20 ms 20 ms 0.so-7-0-0.TL2.LAX2.ALTER.NET (152.63.2.82) 18 ms 11 0.so-6-0-0.TL2.DCA8.ALTER.NET (152.63.3.193) 74 ms 75 ms 74 ms 12 0.so-5-0-0.XL2.DCA8.ALTER.NET (152.63.35.250) 74 ms 74 ms 74 ms 13 188.ATM6-0.GW3.BWI1.ALTER.NET (152.63.39.41) 75 ms 75 ms 76 ms 14 * * * 15 * * * Indicates that Time-Warner and UUNET (Alternet) peer at Los Angeles BWI Airport code Traceroute and other online resources help reveal the dynamic architecture of the Internet

More Related