Iterative Readability Computation for Domain-Specific Resources

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010

Domain-Specific Resources Domain-specific resources cater for a wide range of audience. Wikipedia page on modular arithmetic Interactivate page on clocks and modular arithmetic WING, NUS

Challenge for a Domain-Specific Search Engine How to measure readability for domain-specific resources? WING, NUS

Literature Review • Heuristic Readability Measures • Weighted sum of textual feature values • Examples: • Flesch Kincaid Reading Ease: • Dale-Chall: • Quick and indicative but oversimplifying WING, NUS

Literature Review • Natural Language Processing and Machine Learning Approaches • Extract deep text features and construct sophisticated models for prediction • Text Features • N-gram, height of parse tree, Discourse relations • Models • Language Model, Naïve Bayes, Support Vector Machine • More accurate but annotated corpus required and ignorant of the domain-specific concepts WING, NUS

Literature Review • Domain-Specific Readability Measures • Derive information of domain-specific concepts from expert knowledge sources • Examples: • Wordlist • Ontology • Also improves performance but knowledge sources still expensive and not always available Is it possible to measure readability for domain-specific resources without expensive corpus/knowledge source? WING, NUS

Intuitions • A domain-specific resource is less readable than another if the former contains more difficult concepts • A domain-specific concept is more difficult than another if the former appears in less readable resources • Use an iterative computation algorithm to estimate these two scores from each other • Example: • Pythagorean theorem vs. ring theory WING, NUS

Algorithm • Required Input • A collection of domain-specific resources (w/o annotation) • A list of domain-specific concepts • Graph Construction • Construct a graph representing resources, concepts and occurrence information • Score Computation • Initialize and iteratively compute the readability score of domain-specific resources and the difficulty score of domain-specific concepts WING, NUS

Graph Construction • Preprocessing • Extraction of occurrence information • Construction steps • Resource node creation • Concept node creation • Edge creation based on occurrence information Pythagorean Theorem tangent Concept List Resource 2 Resource 1 triangle Pythagorean Theorem, tangent, triangle trigonometry, sine, Resource 1 Pythagorean Theorem……triangle… …sine……tangent… trigonometry...sine… …tangent……triangle… trigonometry Resource 2 sine WING, NUS

Score Computation • Initialization • Resource Node (FKRE) • Concept Node (Average score of neighboring nodes) • Iterative Computation • All nodes (Current score + average score of neighboring nodes) • Termination Condition • The ranking of the resources stabilizes Resource Nodes w x y z Concept Nodes a b c WING, NUS

Evaluation • Goals • Effectiveness • Iterative computation vs. other readability measures in math domain • Efficiency • Iterative computation with domain-specific resources and concepts selection in math domain • Portability • Iterative computation vs. other readability measures in medical domain WING, NUS

Effectiveness Experiment • Corpus • Collection • 27 math concepts • 1st 100 search results from Google • Annotation • 120 randomly chosen webpages • Annotated by first author and 30 undergraduate students using a 7-point readability scale • Kappa: 0.71, Spearman’s rho: 0.93 WING, NUS

Effectiveness Experiment • Baseline: • Heuristic • FKRE • Supervised learning • Naïve Bayes, Support Vector Machine, Maximum Entropy • Binary word features only • Metrics: • Pairwise accuracy • Spearman’s rho WING, NUS

Effectiveness Experiment • Results • FKRE and NB show modest correlation • SVM and Maxent perform significantly better • Best performance is achieved by iterative computation WING, NUS

Efficiency Experiment • Corpus/Metrics same as before • Different selection strategies • Resource selection by random • Resource selection by quality • Concept selection by random • Concept selection by TF.IDF WING, NUS

Efficiency Experiment • Results • If chosen at random, the more resources/concepts the better • When chosen by quality, a small set of resources is also sufficient • Selection by TF.IDF helps to filter out useless concepts WING, NUS

Portability Experiment • Corpus • Collection • 27 medical concepts • 1st 100 search results from Google • Annotation • Readability of 946 randomly chosen webpages annotated by first author on the same readability scale • Metric/Baseline same as before WING, NUS

Portability Experiment • Results • Heuristic is still the weakest • Supervised approaches benefit greatly from the larger amount of annotation • Iterative computation remains competitive • Limited readability spectrum in medical domain WING, NUS

Future Work • Processing • Noise reduction • Probabilistic formulation • Distribution of values • e.g. 70% of webpages highly readable and 30% much less readable • Correlations between multiple pairs of attributes • e.g. Genericity and page type WING, NUS

Conclusion • Iterative Computation • Readability of domain-specific resources and difficulty of domain-specific concepts can be estimated from each other • Simple yet effective, efficient and portable • Part of the exploration in Domain-specific Information Retrieval • Categorization • Readability • Text to domain-specific construct linking WING, NUS

Any questions? WING, NUS

Related Graph-based Algorithms • PageRank • Directed links • Backlinks indicate popularity/recommendation • HITS • Hub and authority score for each node • SALSA WING, NUS

Iterative Readability Computation for Domain-Specific Resources

Iterative Readability Computation for Domain-Specific Resources

Presentation Transcript

Domain-Specific Corpora

Stereo Computation using Iterative Graph-Cuts

iMapReduce : A Distributed Computing Framework for Iterative Computation

Domain Specific Languages

Cross-Domain Secure Computation

How domain specific are Domain Specific Languages?

An Exercise in Iterative Domain-Specific Language Design

Domain-Specific Iterative Readability Computation

Parallelizing Iterative Computation for Multiprocessor Architectures

Unit Testing for Domain-Specific Languages

Domain-Specific Languages:

Framework for Domain - Specific Visual Languages

Domain-Specific Languages for Ubiquitous Parallelism

Iterative Improvement for Domain-Specific Problems

Domain Specific Language

Domain-specific Templates for Refinement Transformations

Domain Specific Languages

Domain Specific Languages

Domain Specific Models

Domain Specific Languages

Parallelizing Iterative Computation for Multiprocessor Architectures