1 / 22

Iterative Readability Computation for Domain-Specific Resources

Iterative Readability Computation for Domain-Specific Resources. By Jin Zhao and Min-Yen Kan 11/06/2010. Domain-Specific Resources. Domain-specific resources cater for a wide range of audience. Wikipedia page on modular arithmetic. Interactivate page on clocks and modular arithmetic.

hallie
Télécharger la présentation

Iterative Readability Computation for Domain-Specific Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010

  2. Domain-Specific Resources Domain-specific resources cater for a wide range of audience. Wikipedia page on modular arithmetic Interactivate page on clocks and modular arithmetic WING, NUS

  3. Challenge for a Domain-Specific Search Engine How to measure readability for domain-specific resources? WING, NUS

  4. Literature Review • Heuristic Readability Measures • Weighted sum of textual feature values • Examples: • Flesch Kincaid Reading Ease: • Dale-Chall: • Quick and indicative but oversimplifying WING, NUS

  5. Literature Review • Natural Language Processing and Machine Learning Approaches • Extract deep text features and construct sophisticated models for prediction • Text Features • N-gram, height of parse tree, Discourse relations • Models • Language Model, Naïve Bayes, Support Vector Machine • More accurate but annotated corpus required and ignorant of the domain-specific concepts WING, NUS

  6. Literature Review • Domain-Specific Readability Measures • Derive information of domain-specific concepts from expert knowledge sources • Examples: • Wordlist • Ontology • Also improves performance but knowledge sources still expensive and not always available Is it possible to measure readability for domain-specific resources without expensive corpus/knowledge source? WING, NUS

  7. Intuitions • A domain-specific resource is less readable than another if the former contains more difficult concepts • A domain-specific concept is more difficult than another if the former appears in less readable resources • Use an iterative computation algorithm to estimate these two scores from each other • Example: • Pythagorean theorem vs. ring theory WING, NUS

  8. Algorithm • Required Input • A collection of domain-specific resources (w/o annotation) • A list of domain-specific concepts • Graph Construction • Construct a graph representing resources, concepts and occurrence information • Score Computation • Initialize and iteratively compute the readability score of domain-specific resources and the difficulty score of domain-specific concepts WING, NUS

  9. Graph Construction • Preprocessing • Extraction of occurrence information • Construction steps • Resource node creation • Concept node creation • Edge creation based on occurrence information Pythagorean Theorem tangent Concept List Resource 2 Resource 1 triangle Pythagorean Theorem, tangent, triangle trigonometry, sine, Resource 1 Pythagorean Theorem……triangle… …sine……tangent… trigonometry...sine… …tangent……triangle… trigonometry Resource 2 sine WING, NUS

  10. Score Computation • Initialization • Resource Node (FKRE) • Concept Node (Average score of neighboring nodes) • Iterative Computation • All nodes (Current score + average score of neighboring nodes) • Termination Condition • The ranking of the resources stabilizes Resource Nodes w x y z Concept Nodes a b c WING, NUS

  11. Evaluation • Goals • Effectiveness • Iterative computation vs. other readability measures in math domain • Efficiency • Iterative computation with domain-specific resources and concepts selection in math domain • Portability • Iterative computation vs. other readability measures in medical domain WING, NUS

  12. Effectiveness Experiment • Corpus • Collection • 27 math concepts • 1st 100 search results from Google • Annotation • 120 randomly chosen webpages • Annotated by first author and 30 undergraduate students using a 7-point readability scale • Kappa: 0.71, Spearman’s rho: 0.93 WING, NUS

  13. Effectiveness Experiment • Baseline: • Heuristic • FKRE • Supervised learning • Naïve Bayes, Support Vector Machine, Maximum Entropy • Binary word features only • Metrics: • Pairwise accuracy • Spearman’s rho WING, NUS

  14. Effectiveness Experiment • Results • FKRE and NB show modest correlation • SVM and Maxent perform significantly better • Best performance is achieved by iterative computation WING, NUS

  15. Efficiency Experiment • Corpus/Metrics same as before • Different selection strategies • Resource selection by random • Resource selection by quality • Concept selection by random • Concept selection by TF.IDF WING, NUS

  16. Efficiency Experiment • Results • If chosen at random, the more resources/concepts the better • When chosen by quality, a small set of resources is also sufficient • Selection by TF.IDF helps to filter out useless concepts WING, NUS

  17. Portability Experiment • Corpus • Collection • 27 medical concepts • 1st 100 search results from Google • Annotation • Readability of 946 randomly chosen webpages annotated by first author on the same readability scale • Metric/Baseline same as before WING, NUS

  18. Portability Experiment • Results • Heuristic is still the weakest • Supervised approaches benefit greatly from the larger amount of annotation • Iterative computation remains competitive • Limited readability spectrum in medical domain WING, NUS

  19. Future Work • Processing • Noise reduction • Probabilistic formulation • Distribution of values • e.g. 70% of webpages highly readable and 30% much less readable • Correlations between multiple pairs of attributes • e.g. Genericity and page type WING, NUS

  20. Conclusion • Iterative Computation • Readability of domain-specific resources and difficulty of domain-specific concepts can be estimated from each other • Simple yet effective, efficient and portable • Part of the exploration in Domain-specific Information Retrieval • Categorization • Readability • Text to domain-specific construct linking WING, NUS

  21. Any questions? WING, NUS

  22. Related Graph-based Algorithms • PageRank • Directed links • Backlinks indicate popularity/recommendation • HITS • Hub and authority score for each node • SALSA WING, NUS

More Related