1 / 46

Search In Small World Networks

Search In Small World Networks. Presented by George Frederick. Searching the World Wide Web. Steve Lawrence C. Lee Giles. Overview . Research performed in 1998, so results are dated The Web is constantly growing and changing, making size estimation difficult

annot
Télécharger la présentation

Search In Small World Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search In Small World Networks Presented by George Frederick

  2. Searching the World Wide Web Steve Lawrence C. Lee Giles

  3. Overview • Research performed in 1998, so results are dated • The Web is constantly growing and changing, making size estimation difficult • Search engine companies claimed they could keep up • Tested actual search engine coverage

  4. Overview • Typical coverage tests performed by checking number of results returned • Algorithm may not require exact search term matches (related terms) • Documents may no longer exist, meaning engines with stale data have an advantage • Documents may have been altered, changing their relevance

  5. Market Share Salzberg and Etzioni attempted to calculate “market share” of search engines using their MetaCrawler aggregate search service Calculated as percentage of documents users followed through from each search engine

  6. Market Share According to Salzber and Etzioni in 1997

  7. Market Share • Drawbacks to calculation method • Difficult to for users to determine relevance without clicking through and examining pages first • User relevance judgments are biased by presentation order

  8. Web Coverage • Selberg and Etzioni also attempted to calculate Web coverage for each search engine • Flawed because MetaCrawler only retrieves first few results from each engine • May return unique documents for for first few but rest are the same • May return same documents for first few but rest are unique

  9. Web Coverage • Lawrence and Giles made own attempt at search engine Web coverage calculation • Analyzed Alta-Vista, Excite, HotBot, Infoseek, Lycos, and Northern Light • Google not publicly available at time of study • Commonly believed that each indexed roughly same material and each covers most of the Web

  10. Data Gathering • Collected search engines’ responses to queries by NEC Research Institute employees over several days • Retrieved all matching indices and corresponding documents for each query in order to count • All indices needed to avoid page rank bias • All documents needed to ensure they still existed and were not altered since indexing

  11. Data Gathering Duplicate documents not counted twice, even with different URLs Only lowercase queries considered Pages not displayed within a minute were not counted Considered only queries returning up to 600 results collectively Only exact search terms were counted 575 queries analyzed in total

  12. Sources of Bias Size of Web based on search engine coverage overlap Biased because not all pages could be indexed Set of all pages that can be indexed by search engines is referred to as “indexable Web”

  13. Sources of Bias Pages are often manually registered to several different search engines, meaning indices are not collected randomly Pages that are highly linked to by other pages are more likely to be indexed

  14. Methodology • Authors expected that larger engines have lower dependence • Don’t rely as heavily on user submitted indices • Can crawl and find less popular pages • Assumption is that the larger an engine is, the more accurate an estimate it can provide of the size of the Web

  15. Methodology Analyzed overlap between the largest engines (AltaVista and HotBot) Estimated that the “indexable Web” has a lower bound of 320 million pages Common earlier estimates ranged from only 75 - 200 million Authors assert that these were significant underestimations

  16. Size Estimates Estimated indexable Web size

  17. Size Estimates Percentage of indexable Web each search engine covers

  18. Additional Analysis Percentage of invalid links

  19. Additional Analysis Median age of documents Results suggest that engines with most recent pages don’t necessarily have best coverage Tradeoff between database size and update frequency

  20. Results Search engine coverage varies by an order of magnitude Indexable Web estimated to have lower bound of 320 million pages Engines index only a fraction of the Web Individual engines covered between 3-34% of the indexable Web each

  21. Conclusion Combining results of multiple search engines significantly increases number of unique results Search aggregators would significantly increase quality of search results

  22. The Small-World Phenomenon and Decentralized Search Jon Kleinberg

  23. Small World Search Related to gossip algorithms in that each node works with only local knowledge but presents emergent behavior Watts-Strogatz model involves d-dimensional lattice with uniformly random shortcuts Possible to prove that no decentralized search can find short paths with only local knowledge

  24. Small World Search • Subtle variation involves shortcuts with probability that decays like the d’th power of their distance (in d dimensions) and would support efficient search • i.e. a node is approximately as likely to create shortcuts at distances 1 to 10 as it is at distances 10 to 100, 100 to 1000, etc.

  25. Small World Search A node with several random shortcuts spanning different distance scales

  26. Small World Search Construction of networks based on Watts-Strogatz variation has been successfully employed in peer-to-peer file-sharing systems and on the Internet

  27. Identity and Search In Social Networks Duncan J. Watts Peter Sheridan Dodds M. E. J. Newman

  28. Overview Proposes model that defines a class of searchable networks and a method for searching them that is applicable to many network search problems

  29. Searchability Searchability is defined as the property of being able to find a target quickly Searchability has been shown to exist in scale-free and lattice networks, but neither is a satisfactory model of society

  30. Social Network Model Authors assert that proposed model is based on plausible social structures Follows naturally from six contentions about social networks

  31. 1. Identities Nodes in social networks have identities in addition to relationships Identities are defined as a set of characteristics individuals attribute to themselves and others through their association In social groups A group is defined as a collection of nodes with a well-defined set of social characteristics

  32. 2. Hierarchical View • Individuals break down world into hierarchy of more and more specific layers • Top layer is the world • Bottom layer is the individual • In practice, individuals don’t usually go all the way down but stop at a cognitively manage layer • A reasonable upper bound on group size is g=100

  33. 2. Hierarchical View • The similarity xij between individuals i and j is the height of their lowest common ancestor level in this hierarchy • Xij = 1 if i and j are in the same group • Hierarchies are defined to have a depth l and branching ratio b • Purely cognitive construct for social distance measure, not actual network

  34. 3. Homophily The more similar individuals are, the more likely it is that they know each other To construct the network, randomly choose a node i and a link distance x with probability p(x) = c exp{-αx} α is a tunable parameter (measure of homophily) c is a normalizing constant

  35. 3. Homophily Choose a second node j uniformly from all nodes with distance x from i Repeat until the nodes have an average of z friends

  36. 3. Homophily When e-α « 1 all links will be as short as possible, meaning individuals only have connections to those most similar to themselves, forming many isolated cliques When e-α = b, individuals are equally likely to be linked with any other individuals, resulting in a uniform random graph

  37. 4. Multiple Hierarchies Individuals split the world into multiple hierarchies dependent upon context i.e. geography or occupation These represent different societal dimensions A node’s identity is defined as an H-dimensional coordinate vector vi where vih is the position of node i in the h’th hierarchy/dimension

  38. 4. Multiple Hierarchies Each i is randomly assigned a coordinate for each of the H hierarchies and allocated friends as previously described, randomly choosing a hierarchy h for each link When H = 1 and e-α « 1, the link density must obey the constraint z < g

  39. 5. Social Distance • Individuals have their own perception of “social distance” yij = minh xijh • Close proximity in just one hierarchy is sufficient to connote affiliation • Violates triangle inequality • Individuals i and j can be close in one hierarchy and individuals j and k close in another, but i and k may still be far apart in both hierarchies

  40. 6. Local Knowledge • Individuals have only local information • Its own coordinate vector vi • Its neighbors’ coordinate vectors vj • Its target’s coordinate vector vt • Social distance and network paths are known • Neither is sufficient for efficient searching • Both combined are capable

  41. 6. Local Knowledge Following Milgram’s example with each node forwarding a message to one other node j perceived to be closer to the target t

  42. Modeling Milgram • Principle objective is to find average length <L> of a message chain between a randomly selected sender s and target t is small • Small has been previously defined to mean that <L> grows slowly with population size N • Insufficient as probability of chain termination at each hop of p=0.25 • Requires absolute bound

  43. Modeling Milgram A searchable network is defined as one with a probability of successful delivery q is at least some fixed value r In terms of chain length, the authors formally require q = <(1-p)L> >= r Maximum required <L> <= ln r/ln (1-p) For experimental purposes, the authors set r=0.05 and p=0.25 and requiring <L> <= 10.4, independent of population size N

  44. Conclusion By setting the parameters so, in accordance with the 6 sociological contentions, the authors are able to recreate Milgram’s results in the simulation Concept is general enough to be applied to many types of networks beyond social such as p2p, the Web, citation networks

  45. Conclusion The multi-dimensional aspect of the model makes decentralized database organization and searching more efficient with simple, greedy algorithms

  46. Papers • S. Lawrence and C.L. Giles. Searching the world wide web. Science, 280(4):98--100 (1998). http://citeseer.ist.psu.edu/lawrence98searching.html • J. Kleinberg. The Small-World Phenomenon and Decentralized Search. SIAM News 37(3), April 2004 • D. J. Watts, P. S. Dodds, M. E. J. Newman. Identity and Search in Social Networks. Science 269(5571), 2002.

More Related