Improving Gender Classification of Blog Authors

Improving Gender Classification of Blog Authors Arjun Mukherjee Bing LIu UIC

Introduction • Dataset • 3100 blogs ( 1588 – men,1512 – women ) • Related work • Current systems use POS n-grams, word classes, personality types to capture stylistic behavior of authors’ writings for classifying gender. • However, these works use only one or a subset of the classes of features. None of them uses all features for classification learning.

Feature Engineering and mining • F-Measure ( Not the accuracy!!! ) • F-Measure is used to measure contextuality and formality • F = 0.5 * [(freq.noun + freq.adj + freq.prep + freq.art) – (freq.pron + freq.verb + freq.adv + freq.int) + 100] • Stylistic features • Words such as lol, hmm and smileys determine style of person

Feature….(contd) • Gender Preferential features • Females tend to post emotionally. Frequent use of intensive adverbs ( terribly, awfully….) whereas men tend to more provocative

Features…( cont. ) • Factor analysis and word classes • Factor or word factor analysis refers to the process of finding groups of similar words that tend to occur in similar documents. • POS sequence pattern • A POS sequence pattern is a sequence of consecutive POS tags that satisfy some constraints • POS ngrams are good at capturing the heavy stylistic and syntactic information. Instead of using all such n-grams, we want to discover all those patterns that represent true regularities, and we also want to have flexible lengths

Features…( contd.. ) • POS Sequence patterns • mining algorithm mines all such patterns that satisfy the user-specified minimum support (minsup) and minimum adherence (minadherence) thresholds or constraints.

POS sequence algo.

Feature Selection • System uses EFS( Ensemble feature selection ) algorithm for selecting the fearutes. • EFS is a hybrid of filter and wrapper techniques • Some of the criteria used for feature selection are information gain, Mutual information, Chi square test • Feature value assignment • The values are either boolean or term frequency

Experiments and results • Classifiers • Naïve Bayes • SVM • SVM Regression

Results

Comparison results

Improving Gender Classification of Blog Authors

Improving Gender Classification of Blog Authors

Presentation Transcript

Authors

Improving the ethnic classification of patient registers

Title of Paper Authors: Paper Authors

GENDER CLASSIFICATION BY SPEECH ANALYSIS

Experiments with Mood Classification in Blog Posts

Alumni Authors of

blog…blog…blog

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors

Authors

Authors

Authors:

Gender Classification of Japanese Authors

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors

Authors

Authors

List of authors

Improving Gender Statistics: A World Bank Plan of Action

Gender Classification based on Facial information

Towards Improving Classification of Real World Biomedical Articles

Improving Gender Statistics A World Bank Plan of Action

Gender Classification of Japanese Authors

Title: IMPROVING METHOD FOR STRUCTURAL ANALYSIS OF DAILY RUNOFF SERIES Authors: