170 likes | 279 Vues
This presentation discusses recent developments in automatic web resource discovery, focusing on techniques such as distillation, hypertext classification, and topic distillation. It explores the challenges faced in scalable resource discovery and the limitations of traditional search engines. The need for personalized and efficient focused crawling systems is highlighted for future advancements in the field.
E N D
Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao
Introduction • Classical IR: • Indexing a collection of documents • Answering queries by returning a ranked list of relevant document • Problems for retrieve online document • Ambiguity • Context sensitivity • Synonymy • Polysemy • Large amount of relevant Web pages
Introduction Directory-based topic browsing: tree-like structure • Most Maintained by human expert • Advantages: exemplary, influential • Disadvantages: slow, subjective and noisy
Introduction • Standard crawler and search engine • 1997: cover 35-40% out of 340 million Web pages • 1999: cover 18% out of 800 million Web pages • Cannot be used for maintaining generic portals and automatic resource discovery
Introduction • Focused crawler: • Can selectively seek out pages that are relevant to pre-defined set of topics • Experts and researchers preferred • Two modules: • Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog • Distiller: identifies the centrality of crawled pages to determine visit priorities
Distillation techniques • Google: • Simulate a random wander on the Web • Ranked by pre-computed popularity and visitation rate • fast
Distillation techniques • HITS (Hyperlink Induced Topic Search): • Depends on a search engine • Combine two scores: • Authorities: identify pages with useful information about a topic • Hubs: identify pages that contain many links to pages with useful information on the topic • Query dependent and slow • May lead topic contamination or drift
Distillation techniques • ARC and CLEVER: • ARC (Automatic Resource Complier): part of CLEVER • Root set was expanded by 2 links instead of 1link ( Including all pages which are link-distance two or less from at least one page in the root set ) • Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document
Distillation techniques • Outlier filtering: • Computes relevance weights for pages using Vector Space Model • All pages whose weights are below a threshold are pruned • Effectively prune away outlier nodes in the neighborhood, thus avoid contamination
Topic distillation vs. Resource discovery • Topic distillation: • Depend on large, comprehensive Web crawls and indices (Post processing) • Can be used to generate a Web taxonomy? • Set a keyword query for each node in the taxonomy • Run a distillation program • Simple but have some problems
Topic distillation vs. Resource discovery • Problems: • Construction the query: involves trial, error and complicated thought • Query: “North American telecommunication companies” • Query: +"power suppl*" ßwitch* mode" smps -multiprocessor* üninterrupt* power suppl*" ups -parcel The Yahoo! node /Business&Economy /Companies /Electronics /PowerSupplies • To match the directory based browsing quality of : • Yahoo!: 7.03 terms and 4.34 operators • Alta Vista: 2.35 terms and 0.41 operators
Topic distillation vs. Resource discovery • Problems: • Contamination • stop-sites: not automatic • terming weighting • edge weighing: no precise algorithm to set the weight • Topic distillation by itself is not enough for resource discovery
Hypertext classification: learning from example • Adding example pages and their distance-1 neighbors into the graph to be distilled will improve the result • The contents of the given example and its neighbors provide a way to compute the decision boundary of classification • NN, Bayesian and support vector classifiers
Hypertext classification • Link-based features: important • Circular topic influence • Topic of one page influences its text and its neighbor page’s topic • Knowledge of the linked vicinity’s topic provides clues for the test document’s topic • Bibliometric, more general than the simple linear endorsement model used in topic distillation
Conclusion • Emphasized the importance of scalable automatic resource discovery • Argued that common search engines are not adequate to achieve the resource discovery • Introduced the recently invented focused crawling system
Future Works • How to derive the training examples automatically? • How to personalize the outcome of focused crawler for users?