Advances in Web Resource Discovery Techniques and Challenges Ahead

Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao

Introduction • Classical IR: • Indexing a collection of documents • Answering queries by returning a ranked list of relevant document • Problems for retrieve online document • Ambiguity • Context sensitivity • Synonymy • Polysemy • Large amount of relevant Web pages

Introduction Directory-based topic browsing: tree-like structure • Most Maintained by human expert • Advantages: exemplary, influential • Disadvantages: slow, subjective and noisy

Introduction • Standard crawler and search engine • 1997: cover 35-40% out of 340 million Web pages • 1999: cover 18% out of 800 million Web pages • Cannot be used for maintaining generic portals and automatic resource discovery

Introduction • Focused crawler: • Can selectively seek out pages that are relevant to pre-defined set of topics • Experts and researchers preferred • Two modules: • Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog • Distiller: identifies the centrality of crawled pages to determine visit priorities

Distillation techniques • Google: • Simulate a random wander on the Web • Ranked by pre-computed popularity and visitation rate • fast

Distillation techniques • HITS (Hyperlink Induced Topic Search): • Depends on a search engine • Combine two scores: • Authorities: identify pages with useful information about a topic • Hubs: identify pages that contain many links to pages with useful information on the topic • Query dependent and slow • May lead topic contamination or drift

Distillation techniques • ARC and CLEVER: • ARC (Automatic Resource Complier): part of CLEVER • Root set was expanded by 2 links instead of 1link ( Including all pages which are link-distance two or less from at least one page in the root set ) • Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document

Distillation techniques • Outlier filtering: • Computes relevance weights for pages using Vector Space Model • All pages whose weights are below a threshold are pruned • Effectively prune away outlier nodes in the neighborhood, thus avoid contamination

Topic distillation vs. Resource discovery • Topic distillation: • Depend on large, comprehensive Web crawls and indices (Post processing) • Can be used to generate a Web taxonomy? • Set a keyword query for each node in the taxonomy • Run a distillation program • Simple but have some problems

Topic distillation vs. Resource discovery • Problems: • Construction the query: involves trial, error and complicated thought • Query: “North American telecommunication companies” • Query: +"power suppl*" ßwitch* mode" smps -multiprocessor* üninterrupt* power suppl*" ups -parcel The Yahoo! node /Business&Economy /Companies /Electronics /PowerSupplies • To match the directory based browsing quality of : • Yahoo!: 7.03 terms and 4.34 operators • Alta Vista: 2.35 terms and 0.41 operators

Topic distillation vs. Resource discovery • Problems: • Contamination • stop-sites: not automatic • terming weighting • edge weighing: no precise algorithm to set the weight • Topic distillation by itself is not enough for resource discovery

Hypertext classification: learning from example • Adding example pages and their distance-1 neighbors into the graph to be distilled will improve the result • The contents of the given example and its neighbors provide a way to compute the decision boundary of classification • NN, Bayesian and support vector classifiers

Hypertext classification • Link-based features: important • Circular topic influence • Topic of one page influences its text and its neighbor page’s topic • Knowledge of the linked vicinity’s topic provides clues for the test document’s topic • Bibliometric, more general than the simple linear endorsement model used in topic distillation

Putting it together for resource discovery

Conclusion • Emphasized the importance of scalable automatic resource discovery • Argued that common search engines are not adequate to achieve the resource discovery • Introduced the recently invented focused crawling system

Future Works • How to derive the training examples automatically? • How to personalize the outcome of focused crawler for users?

Advances in Web Resource Discovery Techniques and Challenges Ahead

Advances in Web Resource Discovery Techniques and Challenges Ahead

Presentation Transcript

Resource Discovery Strategies

Resource Discovery in Self-Organizing Networks

Resource Discovery

Recent Results from

Automatic Causal Discovery

Resource discovery

Resource Curation and Automated Resource Discovery

Automatic Effective Model Discovery

ATF2 Recent Results

Recent LHC Results

Discovery Survey Results

Recent Results from

Recent automatic enrollment research

Recent Results

Recent results

Recent NMR Results in NCKU

Recent results

Resource Discovery...

Recent Results in Resource Signal Measurement, Dissemination, and Prediction

Resource Discovery Futures

Automatic Effective Model Discovery

Automatic Resource Detection