STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION

STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION S. Sameen Fatima Dept. of Computer Science & Engineering Osmania University Hyderabad (sameenf@hotmail.com)

BACKGROUND Q. What is classification (of text)? A. Classification is an important IR task in which one or more category labels are assigned to a document. Approaches to Classification (of text) Earlier approaches to text classification assigned labels to documents based on CONTENTS 1. Word-based techniques -statistical (tf,idf) - term/keyword searches Advantage: Simple and can be automated Disadvantage: Phrases cannot be extracted 2. Phrase-based techniques a) In-depth NLP: Here we aspire to represent all the information in a text using context. -syntax -semantics -statistics Advantage: General task-independent representation Disadvantage: Costly, Not possible in polynomial time b) Information Extraction: Here we delimit in advance, as part of the specification of a task, the semantic range of the output, the relations we will represent and other allowable fillers in each slot. Advantage: It works well for a specific corpus Disadvantage: For a new corpus a new IE system will be designed.

Limitations of the Earlier Approaches to Text Classification Texts have, besides content, STYLE which has not been accounted for. It is the focus of this talk to present STYLE as a new basis for text classification

COMPUTATIONAL STYLISTICS The study of style or in other words the detection of patterns common to a writing is known as STYLISTICS. If stylistic analysis uses computer-aided methods and statistical methods for analysis of texts, the field of study is called COMPUTATIONAL STYLISTICS.

Related Work in Computational Stylistics 1. Pre-WWW Era: - Author Attribution Studies: Popular Mosteller and Wallace’s study of anonymous essays published in THE FEDERALIST to identify the authors (Hamilton and Madison). Stylistic parameters: sentence-length, content words(nouns, adjectives, verbs), function words(preposition, conjunction), use of by, from, and to, …….. Came up with interesting result that content words were too subject-dependent and were not good discriminators, while function words were good discriminators. - Automatic Abstracting: Borko and Chatman advanced the view that it seems possible to make stylistic distinctions between informative (discusses research) abstracts and indicative (discusses the article whichh descsribes the research) abstracts, based on form, voice, tense, focus of the abstract. - Teaching writing styles for different types of documents. Writer’s WorkBench program on AT&T Unix.

Related Work in Computational Stylistics(contd) 2. WWW-Era: (on-going) -Stylistic variation between the different genres found in the Wall Street Journal. (Jussi Karlgren, Troy Strazheim) Example: Articles, Business News with tables, Business News, Lists of briefs, Editorials, letters, Briefs, “What’s New”, Tables. Use simple stylistic parameters: characters/word, digits/keywords, words/sentence. - Establishing a genre palette for internet material. (Jussi Karlgren, John Dewe, Ivan Bretan)

Definition of a Genre/Functional Style A set of documents with a perceived consistent tendency to make the same stylistic choices, specifically if it has an established communication functions, a functional style. Genres can have differing usefulness Genres in my work (Corpus) Editorials from Hindu Editorials from Hindustan Times Editorials from Times of India

Hypothesis Editorials from each newspaper show a systematic and consistent difference in the choice of a presentation style, specifically to establish some intended communication function (aggressive, conservative, liberal) Aim of the Experiment To find a descriptive and predictive algorithm for classifying editorials from different newspapers based on stylistic features.

Mathematical Model Two models were explored to find which was applicable. 1. Vector Space Model - Used by Salton in the SMART system (IRS) 2. Euclidean Space Model. Euclidean Space Model An n-dimensional Euclidean space, En is defined as the set of all n-tules of real numbers (x1, x2, …., xn) where the Euclidean distance in En between 2 points: x = (x1, x2, …., xn) and y = (y1, y2, …., yn) is defined by d(x,y) = sqrt((x1-y1)2 + (x2-y2)2 + ……………………….+ (xn-yn)2) In our project Euclidean Space represents a Stylistic Space

In the Vector Space Model distance between two points x and y is related by the angle (x,y) formed by the lines from each of the points to the origin, which is given by cos (x,y) = (x . y) / ( (x .x)0.5 (y . y)0.5) This failed in stylistic analysis

Stylistic Profiling A method of identifying the stylistic features in the writing style of an individual or a group of people and to present them in a systematic way. 1. Lexical Features • Percentage of interrogative pronouns • Percentage of emphatic pronouns • Percentage of prepositions • Percentage of conjunctions • Percentage of articles • Percentage of action words • Percentage of unique words 2. Structural Features • Average words/sentence • maximum sentence length • Total no. of sentences • Total no. of words • Total no. of characters 3. Affective Features • Percentage of passive sentences • Flesch Reading Ease • Coleman Liau Grade level • Bormuth Grade Level

Classification Algorithm 1. Training Phase 90 FSPs Training set consisting of 30 editorials each from H, HT, TI 90 SPs Conduct ANOVA test & extract the SIGNIFICANT FEATURES Compute the mean for each of the significant features for each newspaper 3 Prototypes Feature Extraction (Lexical, Structural, Affective) P-H P-HT P-TI 2. Classification Phase Least d(I,P-H), Classify as Hindu FSP, I Significant Feature Extraction Compute the distance between I and each of the prototypes from the training phase New instance of editorial Least d(I,P-HT), Classify as HT Least d(I,P-TI), Classify as TI

Results 1. Data Collection (SP) 2. Results of identifying significant features in the training phase (FSP): One-tailed ANOVA test was carried out Null hypothesis: No difference between the means Alternate hypothesis: Means are different ratio of the variance estimates is calculated, F=Sb2/Sw2 Sb2 = Sw2 (Check for null hypothesis) Sb2 > Sw2 (Check for alternate hypothesis) F > Fcrit for a particular significance level, then we say that the means of the feature are significantly different 3. Results of the classification phase

Performance Evaluation Following measures were computed: Precision = Number-classified-correctly/Number-total-classified Recall = Number-classified-correctly/Number-relevant-for-classification Conclusion The results of the experiment were positive. It was possible to classify editorials with a good degree of recall and precision

Scope for further work Currently, it is not clear whether topic and style are two independent dimensions of variation in text, or they go hand in hand. This can be further explored by subclassifying editorials based on topic and then studying each of them for stylistic variations Applications - For classifying documents on the Internet based on GENRE - Relating FSPs of editorials to the reader profiles for each newspaper so as to establish any interesting relationship.

STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION

STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION

Presentation Transcript

Testing as a Genre

Classification and Variation

Musical Genre Classification

Consideration As a Basis For Enforcement

Text Classification

Genre Classification

A Semantic Text Classification Based on DBpedia

TEXT CLASSIFICATION

On Compression-Based Text Classification

Text Classification

Carbon as a Basis for Life

Text Classification

Text Classification

Automatic stylistic processing for classification and transformation of natural language text

Text Classification

Text Classification

TEXT CLASSIFICATION -----SVM-based Approach

Text Classification

Classification Text

Text Classification

TEXT CLASSIFICATION