A Comparison of Features for Automatic Readability Assessment

A Comparison of Features for Automatic Readability Assessment LijunFeng1 Matt Huenerfauth1 Martin Jansche2No´emieElhadad3 1City University of New York 2Google, Inc. 3Columbia University Coling 2010: Poster Volume, pages 276–284, Beijing, August 2010 報告者：劉憶年 2014/4/21

Outline • Introduction • Corpus • Features • Experiments and Discussion • Conclusions

Motivation and Method • Readability Assessment quantifies the difficulty with which a reader understands a text. Automatic readability assessment enables the selection of appropriate reading material for readers of varying proficiency. • We use grade levels, which indicate the number of years of education required to completely understand a text, as a proxy for reading difficulty. • We treat readability assessment as a classification task and evaluate trained classifiers in terms of their prediction accuracy.

RelatedWork • Many traditional readability metrics are linear models with a few (often two or three) predictor variables based on superficial properties of words, sentences, and documents. • These traditional metrics are easy to compute and use, but they are not reliable, as demonstrated by several recent studies in the field. • With the advancement of natural language processing tools, a wide range of more complex text properties have been explored at various linguistic levels. • In addition to lexical and syntactic features, several researchers started to explore discourse level features and examine their usefulness in predicting text readability.

Corpus • We contacted the Weekly Reader1 corporation, an on-line publisher producing magazines for elementary and high school students, and were granted access in October 2008 to an archive of their articles. • While pre-processing the texts, we found that many articles, especially those for lower grade levels, consist of only puzzles and quizzes, often in the form of simple multiple-choice questions. We discarded such texts and kept only 1433 full articles.

Entity-Density Features(1/2) • Conceptual information is often introduced in a text by entities, which consist of general nouns and named entities, e.g. people’s names, locations, organizations, etc. These are important in text comprehension, because established entities form basic components of concepts and propositions, on which higher level discourse processing is based. • We hypothesized that the number of entities introducedin a text relates to the working memoryburden on their targeted readers – individuals withintellectual disabilities. We defined entities as aunion of named entities and general nouns (nounsand proper nouns) contained in a text, with overlappinggeneral nouns removed.

Entity-Density Features(2/2) • We believe entity-density features may also relate to the readability of a text for a general audience.

Lexical Chain Features • During reading, a more challenging task with entities is not just to keep track of them, but to resolve the semantic relations among them, so that information can be processed, organized and stored in a structured way for comprehension and later retrieval. • The length of a chainis the number of entities contained in the chain,the span of chain is the distance between the indexof the first and last entity in a chain. A chain isdefined to be active for a word or an entity if thischain passes through its current location.

Coreference Inference Features • Relations among concepts and propositions are oftennot stated explicitly in a text. Automatically resolvingimplicit discourse relations is a hard problem. • The ability to resolve referential relations isimportant for text comprehension. • Inferencedistance is the difference between the index of thereferent and that of its pronominal reference. If thesame referent occurs more than once in a chain,the index of the closest occurrence is used whencomputing the inference distance.

Entity Grid Features • Coherent texts are easier to read. • Each text is abstracted into a gridthat captures the distribution of entity patterns atthe level of sentence-to-sentence transitions. Theentity grid is a two-dimensional array, with one dimensioncorresponding to the salient entities in thetext, and the other corresponding to each sentenceof the text.

Language Modeling Features • We use grade levels to divide the whole corpus into four smaller subsets. In addition to implementing Schwarm and Ostendorf’s information-gain approach, we also built LMs based on three other types of text sequences for comparison purposes. These included: word-token-only sequence (i.e., the original text), POS-only sequence, and paired word-POS sequence. For each grade level, we use the SRI Language Modeling Toolkit5 (with Good-Turing discounting and Katz backoff for smoothing) to train 5 language models (1- to 5-gram) using each of the four text sequences, resulting in 4×5×4 = 80 perplexity features for each text tested.

Parsed Syntactic Features • Our parsed syntactic features focus on clauses (SBAR), noun phrases (NP), verb phrases (VP) and prepositional phrases (PP).

POS-based Features • Part-of-speech-based grammatical features were shown to be useful in readability prediction. To extend prior work, we systematically studied a number of common categories of words and investigated to what extent they are related to a text’s complexity. We focus primarily on five classes of words (nouns, verbs, adjectives, adverbs, and prepositions) and two broad categories (content words, function words). Content words include nouns, verbs, numerals, adjectives, and adverbs; the remaining types are function words.

Shallow Features • Shallow features, which are limited to superficial text properties, are computationally much less expensive than syntactic or discourse features.

Other Features • For comparison, we replicated 6 out-of-vocabulary features described in Schwarm and Ostendorf (2005). • We also replicated the 12 perplexity features implemented by Schwarmand Ostendorf(2005).

Experiments and Discussion • In our research, we have used various models, including linear regression; standard classification (Logistic Regression and SVM), which assumes no relation between grade levels; and ordinal regression/ classification (provided by Weka, with Logistic Regression and SMO as base function), which assumes that the grade levels are ordered. • In this paper, we present the best results, which are obtained by standard classifiers. • We evaluate classification accuracy using repeated 10-fold cross-validation on the Weekly Reader corpus. Classification accuracy is defined as the percentage of texts predicted with correct grade levels. We repeat each experiment 10 times and report the mean accuracy and its standard deviation.

Discourse Features • We first discuss the improvement made by extendingour earlier entity-density features (Feng et al.,2009). • Withearlier features only, the model achieves 53.66%accuracy. With our new features added, the modelperformance is 59.63%. • Table 6 presents the classification accuracy of models trained with discourse features.

Language Modeling Features • Table 7 compares the performance of models generated using our approach and our replication of Schwarmand Ostendorf’s (2005) approach.

Parsed Syntactic Features(1/2) • Table 8 compares a classifier trained on the fourparse features of Schwarm and Ostendorf (2005) toa classifier trained on our expanded set of parse features.

Parsed Syntactic Features(2/2) • Table 9 shows adetailed comparison of particular parsed syntacticfeatures.

POS-based Features • The classification accuracy generated by modelstrained with various POS features is presentedin Table 10.

Shallow Features • We present some notable findings on shallow featuresin Table 11.

Comparison with Previous Studies • Using the same experiment design, we train classifiers with three combinations of our features as listed in Table 12.

Conclusions • Discourse features do not seem to be very useful in building an accurate readability metric. The reason could lie in the fact that the texts in the corpus we studied exhibit relatively low complexity, since they are aimed at primary-school students. In future work, we plan to investigate whether these discourse features exhibit different discriminative power for texts at higher grade levels.

A Comparison of Features for Automatic Readability Assessment