finding information on the web n.
Skip this Video
Loading SlideShow in 5 Seconds..
Finding Information on the web PowerPoint Presentation
Download Presentation
Finding Information on the web

Finding Information on the web

333 Vues Download Presentation
Télécharger la présentation

Finding Information on the web

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Finding Information on the web Srinivasan Seshadri CTO Kosmix

  2. Early Internet (1992 – 1994) • Mozilla Browser • People linked to others home pages and other interesting pages • People really browsed

  3. INTERNET (1995 – 2002) • Search - Altavista, Lycos • Google • Used Hyperlink Graph Structure to Rank Results

  4. Internet Now • Kosmix bringing back joys of browsing and exploring • 360 degree view of any topic • Topic Home page (why not a topic ) • Top Informational Sites for a topic and a preview (snippets) are the results!

  5. INFORMATION TYPES • Factual Information (Wiki etc.) • Videos • Images • Forum Discussions • Question and Answers • News • Blogs • Structured Information

  6. FUTURE OF SEARCH • First step towards providing multiple pivot points for a topic or search • Need to make this conversational, stateful – like talking to an expert on the topic..

  7. transient Intent and Persistent Intent • TRANSIENT INTENT • Searching for a needle in the haystack • Exploring the haystack for a topic • PERSISTENT INTENT • Interested in the topic for a long time • Carnatic Music, Indian Cricket, Internet Industry, Venture Capital

  8. INFORMATION Deliver information to the consumer what they want when they want how they want where they want

  9. PERSONALIZED NEWSPAPER • My World is Changing • Can not keep track of it • Can my world come to me?

  10. MEDIA INDUSTRY AND INTERNET • Huge pressure on newspapers • Ad spending moving online • More and more content online • Reputed journalists have their own blogs • Content Production; Aggregation and Distribution is becoming disaggregated • Vanilla online newspaper does not exploit what the internet enables • Ability to personalize to nano interests • Publish a personalized newspaper for everyone any time

  11. Key technology Ingredients • Cloud Computing • Categorization • Relevance

  12. Cloud computing at kosmix • Storage: • Biggest Productivity boost in kosmix in the first year • Getting machines to be remotely rebooted! • KFS (Kosmix File System) further lowered the time to make data accessible after machine failures • Computation: • Long Running Computations need to be broken into small restartable/replayable components

  13. Cloud computing at kosmix • Computation Templates: • Most of the computation could be expressed as some variant of a single table scan and some aggregate operation (group by) -- called MapReduce by google • MapReduce not friendly enough to non programmers • SQL not powerful enough in many situations • Need a nice scripting language ..

  14. Opportunity? • Many many companies trying to provide interesting web services • A gold mine of information in the web that can be used by companies • Impractical for each of the companies to build a huge web scale support system (crawling, indexing, KFS, MapReduce etc. etc.) • Further most companies want slivers of the web (typically category based slivers – health forums; travel news sites etc. etc.) • Web and all the derived information is the biggest database perhaps -- can some one make this accessible and easy to use (using some pay you go model) or perhaps some non profit (academia?) angle here?

  15. Categorization • Concept Space: space in which all connections are made within kosmix • Documents, Queries, External Modules, Advertisements, People are all mapped to points in this space and matched.. • Internet Industry, Venture Capital documents need to be mapped to these categories even if they don’t contain the original words

  16. Kategorization at kosmix • Leverage human curated sources • Wiki corpus is a majorr source of knowledge • Huge Automatically Curated Taxonomy • 6 million concepts • Building a Concept Graph with relationship labels where possible • Use a web index to match short pieces of texts with concepts and use taxonomy to refine the matches

  17. Relevance • Need to combine multiple signals into one number to enable ranking • Say Query Relevance Score and Page Relevance Score (text score and page rank) • Signals need to be made comparable • Normalization alone (making ranges the same) is not enough • Need to reconcile different distributions • Deviations from the mean

  18. Relevance • More data always beats smarter algorithms • Adding positions information in the index greatly increases quality • Adding stemming saw a CTR rise of 10% • Adding anchors (and page rank) distinguished google • Adding origin of anchors (hosts) is a much better measure of independent votes • Using demand side popularity (alexa, quantcast) complement web popularity

  19. RELEVANCE • What is a news story? • Cluster news articles.. • Use size of cluster as a measure of popularity • How does one do this efficiently? • Needs to be online since interests/queries are ad hoc • Need to combine some offline preclustering and online methods

  20. summary • Consumer: • Internet has come a long way in terms of getting information to people • Utopian goal of a smart, chatty expert still far away – is a great first step • Need good tools to keep on top of the information explosion – personalized newspaper ( is our first stab at this.. • Technology: • Need to deal with large volume of data • Efficient Data Analysis and Annotation (e.g., Categorization) • Humming Next Gen Database System that grows incrementally, immune to failures, expressive for non programmers