Domain-Specific Iterative Readability Computation
This study presents a framework for measuring the readability of domain-specific resources through an iterative computation algorithm. By constructing a graph of resources and concepts, the approach computes the readability scores of various resources and the difficulty levels of their associated concepts. It evaluates heuristic and probabilistic methods to improve accuracy and effectiveness in different domains, such as healthcare and education. The outcomes are aimed at enhancing the accessibility of specialized content to diverse audiences, facilitating better information dissemination.
Domain-Specific Iterative Readability Computation
E N D
Presentation Transcript
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011
Domain-Specific Resources WING, NUS
Domain-Specific Resources Domain-specific resources targets at varying audiences. Modular arithmetic page from Wikipedia Modular arithmetic page from Interactivate.com WING, NUS
Challenge for a Domain-Specific Search Engine How to measure readability for domain-specific resources? WING, NUS
Literature Review • Heuristic-based Readability Measures • Weighted sum of text feature values • Examples: • Flesch Kincaid Reading Ease (FKRE): [Flesch48] • Dale-Chall readability formula: [Dale&Chall48] Quick and indicative but often oversimplify WING, NUS
Literature Review • Natural Language Processing and Machine Learning Approaches • Extract deep text features and use supervised learning methods to generate models for readability measurement • Text Features • Unigram [Collins-Thompson04], Parse tree height [Schwarm05], Discourse relations [Pitler08] • Supervised learning techniques • Support Vector Machine (SVM) [Schwarm05], k-Nearest Neighbor (KNN) [Heilman07] More accurate but annotated corpus required and ignorant of the domain-specific concepts WING, NUS
Literature Review • Domain-Specific Readability Measures • Derive information of domain-specific concepts from expert knowledge sources • Examples: • Open Access and Collaborative Consumer Health Vocabulary [Kim07] • Medical Subject Headings ontology [Yan06] • Handles domain-specific concepts but expert knowledge sources are still expensive and not always available Key qualities of a goodreadability measure: effective, portable and domain-aware. WING, NUS
Intuitions • Use an iterative computation algorithm to estimate these two scores from each other • Example: • Pythagorean theorem vs. ring theory A domain-specific resource isless readable if it contains more difficult concepts A domain-specific concept is more difficult if it appears in less readable resources WING, NUS
Iterative Computation (IC) Algorithm • Graph Construction • Construct a graph representing resources, concepts and occurrence information • Score Computation • Initialize and iteratively compute the readability score of domain-specific resources and the difficulty score of domain-specific concepts • Two versions: heuristic and probabilistic • Required Input • A collection of domain-specific resources • A list of domain-specific concepts WING, NUS
Graph Construction Resource 1 Concept List …Pythagorean theorem can be written as a2 + b2 = c2, where c represents the length of the hypotenuse… … right triangle Pythagorean theorem hypotenuse sine function cosine function … Resource 2 …The sine function (sin) can be defined as the ratio of the side opposite the angle to the hypotenuse… Resource 2 Resource 1 right triangle Pythagorean Theorem hypotenuse sine function cosine function WING, NUS
Score Computation (Heuristic) 2.00 4.00 3.00 1.00 • Initialization • Resource Node (FKRE) • Concept Node (Average score of its adjacent nodes) Resource Nodes w x y z Concept Nodes a b c Initialization 2.00 2.50 3.00 • Iterative Computation • Each node(Original score + average of the original scores of its adjacent nodes) 3.00 5.25 4.75 7.00 Resource Nodes w x y z Concept Nodes a b c Iteration 1 4.00 6.00 5.00 WING, NUS
Score Computation (Heuristic) 9.75 10.25 13.00 7.00 Resource Nodes w x y z Concept Nodes a b c Iteration 2 8.13 10.00 11.88 15.13 18.82 21.19 24.88 • Termination Condition • The rank order of the resource nodes stabilizes Resource Nodes w x y z Concept Nodes a b c Iteration 3 23.51 20.00 16.51 WING, NUS
Score Computation (Heuristic) • Single-valued score for each node • Unable to handle concepts of varying difficulties • Simple averaging in score computation • Difficult to incorporate sophisticated computational mechanisms WING, NUS
Score Computation (Probabilistic) • Initialization • Resource Node (Sentence Sampling) • Concept Node (Resource Sampling) Resource Nodes w x y z Concept Nodes a b c Initialization
Score Computation (Probabilistic) • Iterative Computation • Modified Naïve Bayes Classification Original: Direct Adaptation: Modified: Resource Nodes Concept Nodes
Evaluation • Key qualities of a good readability measure • Effectiveness • Portability • Domain-awareness WING, NUS
Effectiveness • Corpus of Math Webpages • Metrics: • Pairwise accuracy • Spearman’s rho • Baseline: • Heuristic • FKRE • Supervised learning • NB, SVM, MaxEsnt • Binary concept features only WING, NUS
Portability • Different selection strategies • Resource selection at random • Concept selection at random • Resource selection by quality • Concept selection by TF.IDF • Performance measurement at 5 levels • 20%, 40%, 60%, 80% and 100% of the original resource collection / concept list WING, NUS
Portability Concept Selection Strategies Resource Selection Strategies WING, NUS
Portability WING, NUS
Domain-awareness • Handling of domain-specific concepts • Simple yet effective • Concepts of multiple difficulty levels? • Converge to single value even in PIC • Splitting? (K-Means, GMM, etc.) • Other computational mechanisms? WING, NUS
Conclusion • Iterative Computation • Estimate the readability of domain-specific resources and difficulty of domain-specific concepts in a iterative manner • Effective, Portable and Domain-aware • Future Work • Handling of concepts of multiple difficulty levels WING, NUS