A Quality Focused Crawler for Health Information Tim Tang

A Quality Focused Crawler for Health Information Tim Tang

Outline • Overview • Contributions • Experiments and results • Issues for discussion • Future work • Questions & Suggestions?

Overview • Many people use the Internet to search for health information • But… health web pages may contain low quality information, and may lead to personal endangerment. (example) • It is important to find means to evaluate the quality of health websites and to provide high quality results in health search.

Motivation • Web users can search for health information using general engines or domain-specific engines like health portals • 79% of Web users in the U.S search for health information from the Internet (Fox S. Health Info Online, 2005) • No measurement technique is available for measuring the quality of Web health search results. • Also, there is no method for automatically enhancing the quality of health search results • Therefore, people building a high quality health portal have to do it manually and, without work on measurement, we can’t tell how good a job they are doing • Example of such a health portal is BluePages search, developed by the ANU’s centre for mental health research.

BluePages Search (BPS)

BPS result list

Research Objectives • To produce a health portal search that: • Is built automatically to save time, effort, and expert knowledge (cost saving). • Contains (only) high quality information in the index by applying some quality criteria • Satisfies users’ demand for getting good advice (evidence-based medicine) about specific health topics from the Internet

Contributions • New and effective quality indicators for health websites using some IR-related techniques • Techniques to automate the manual quality assessment of health websites • Techniques to automate the process of building high quality health search engines

Expt1: General vs. domain specific search engines • Aim: To compare the performance of general search engines (Google, GoogleD) and domain specific engines (BPS) for domain relevance and quality. • Details: Running 100 depression queries in these engines. The top 10 results for each query from each engine are evaluated. • Results: next slide.

Expt1: Results MAP = Modified Average Precision NDCG = Normalised Discounted Cumulative Gain

Expt1: Findings • Findings: GoogleD can retrieve more relevant pages, but less high quality pages compared to BPS. Domain-specific engines (BPS) have poor coverage (causing worse performance in relevance). • What next: How to improve coverage for domain-specific engines? How to automate the process of constructing a domain specific engine?

Expt2: Prospect of Focused Crawling in building domain-specific engines • Aim: To investigate into the prospect of using focused crawling (FC) techniques to build health portals. In particular: • Seed list: BPS uses a seed list (start list for a crawl) that was manually selected by experts in the field. Can we automate this process? • Relevance of outgoing links: Is it feasible to follow outgoing links from the currently crawled pages to obtain more relevant links? • Link prediction: Can we successfully predict relevant links from available link information?

Expt2: Results & Findings • Out of 227 URLs from DMOZ, 186 were relevant (81%) => DMOZ provides good starting list of URLs for a FC • An unrestricted crawler starting from the BPS crawl can reach 25.3% more known relevant pages in one single step from the currently crawled pages. => Outgoing links from a constraint crawl lead to additional relevant content • Machine learning algorithm C4.5 decision tree can predict link relevance with a precision of 88.15% => A decision tree created using features like anchor text, URL words and link anchor context can help a focused crawler obtain new relevant pages

Expt3: Automatic evaluation of Websites • Aim: To investigate if Relevance Feedback (RF) technique can help in the automatic evaluation of health websites. • Details: RF is used to learn terms (words and phrases) representing high quality documents and their weights. This weighted query is then compared with the text of web pages to find degree of similarity. We call this “Automatic quality tool” (AQT). • Findings: Significant correlation was found between human-rated (EBM) results and AQT results.

Expt3: Results – Correlation between AQT score and EBM score

Correlation: small & non-significant r=0.23, P=0.22, n=30 Excluding sites with PageRank of 0, we obtained better correlation, but still significantly lower than the correlation between AQT and EBM. Expt3: Results – Correlation between Google PageRank and EBM score

Expt4: Building a health portal using FC • Aim: To build a high-quality health portal automatically, using FC techniques • Details: • Relevance scores for links are predicted using the decision tree found in Expt. 2. Relevance scores are transformed into probability scores using Laplace correction formula • We found that machine learning didn’t work well for predicting quality but RF helps. • Quality of target pages is predicted using the mean of quality scores of all the known (visited) source pages • Combination of relevance and quality score: The product of the relevance score and the quality score is used to determine crawling priority

Expt4: Results – Quality scores 3 crawls were built: BF, Relevance and Quality

Expt4: Results – Below Average Quality (BAQ) pages in each crawl

Expt4: Findings • RF is a good technique to be used in predicting quality of web pages based on the quality of known source pages. • Quality is an important measure in health search because a lot of relevant information is of poor quality (e.g. the relevance crawler) • Further analysis shows that quality of content might be further improved by post-filtering a very big BF crawl but at the cost of substantially increased network traffic.

Issues for discussion • Combination of scores • Untrusted sites • Quality evaluation • Relevance threshold choice • Coverage • Combination of quality indicators • RF vs Machine learning

Issue: Combination of scores • The decision to multiply the relevance and quality scores was taken arbitrarily, the idea was to keep a balance between relevance and quality, to make sure both quality and coverage are maintained. • Question: Should addition (or other linear combinations) be a better way to calculate this score? Or rather, only the quality score should be considered? In general, how to combine relevance and quality scores?

Issue: Untrusted sites • Untrusted sites • RF was used for predicting high quality, but … • Analysis showed that low quality health sites are often untrusted sites, such as commercial sites, chat sites, forums, bulletins and message boards. Our results don’t seem to exclude a some of these sites. • Question: Is it feasible to use RF somehow, or any other means to detect these sources? How should that be incorporated into the crawler?

Issue: Quality evaluation expt. • Expensive because manual evaluation for quality requires a lot of expert knowledge and effort. To know the quality of a site, we have to judge all the pages of that site. • Question: How to design a cheaper but effective evaluation experiment for quality? Can lay judgment for quality be used somehow?

Issue: Relevance threshold choice • A relevance classifier was built to help reducing the relevance judging effort. A cut-off point for relevance score needs to be identified. The classifier runs on 2000 pre-judged documents, half are relevant. I decided the cut-off threshold as a score at which the total number of false positive and false negative is minimised. • Question: Is it a reasonable way to decide a relevance threshold? Any alternative?

Issue: Coverage • The FC may not explore all the directions of the Web and resulted in low coverage. It’s important to know how much of the high quality Web documents that the FC can index. • Question: • How to design an experiment that evaluates coverage issue? (How to measure recall?)

Issue: Combination of quality indicators • Health experts have identified several quality indicators that may help in the evaluation of quality, such as content currency, authoring information, information about disclosure, etc. • Question: How can/should these indicators be used in my work to predict quality?

Issue: RF vs Machine Learning • Compared to RF, ML has the flexibility of adding more features such as ‘inherited quality score’ (from source pages) into the leaning process to predict the quality of the results. • However, we’ve tried ML initially to predict quality but found that RF is much better. Maybe because we didn’t do it right!? • Question: Could ML be used in a similar way that RF is used? Does the former promise better result?

Future work • Better combination of quality and relevance scores to improve quality • Involve quality dimension in ranking of health search results (create something similar to BM25, with the incorporation of quality measure?) • Move to another topic in health domain or an entirely new topic? • Combine heuristics, other medical quality indicators with RF?

Suggestions • Any more suggestions to improve my work? • Any more suggestions for future work? • Other suggestions? The end!

A Quality Focused Crawler for Health Information Tim Tang

A Quality Focused Crawler for Health Information Tim Tang

Presentation Transcript

Use of Information Technology for Precision Performance Measurement and Focused Quality Improvement

Privacy, Quality and Electronic Health Information

QUALITY DATA FOCUSED REVIEW

Gnutella Crawler

A Web Crawler Design for Data Mining

Focused Crawling for both Topical Relevance and Quality of Medical Information

Information for Health

A Health Equity Focused Model For Building Healthy Communities

Using NetWellness for Quality Health Information

How to Quality Assure Websites for Health Information

Search for Quality Information

Quality Information for Improved Health

A Health Equity Focused Model For Building Healthy Communities

Resources for Quality Health Information Online

Health Impact Assessment: A tool for health-focused public policy

Focused Crawler

Customer-Focused Quality

Smart Crawler A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

A Health Equity Focused Model For Building Healthy Communities

Build a Crawler for Job Website