1 / 11

Improving Gender Classification of Blog Authors

Improving Gender Classification of Blog Authors. Arjun Mukherjee Bing LIu UIC. Introduction. Dataset 3100 blogs ( 1588 – men,1512 – women ) Related work

Télécharger la présentation

Improving Gender Classification of Blog Authors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Gender Classification of Blog Authors Arjun Mukherjee Bing LIu UIC

  2. Introduction • Dataset • 3100 blogs ( 1588 – men,1512 – women ) • Related work • Current systems use POS n-grams, word classes, personality types to capture stylistic behavior of authors’ writings for classifying gender. • However, these works use only one or a subset of the classes of features. None of them uses all features for classification learning.

  3. Feature Engineering and mining • F-Measure ( Not the accuracy!!! ) • F-Measure is used to measure contextuality and formality • F = 0.5 * [(freq.noun + freq.adj + freq.prep + freq.art) – (freq.pron + freq.verb + freq.adv + freq.int) + 100] • Stylistic features • Words such as lol, hmm and smileys determine style of person

  4. Feature….(contd) • Gender Preferential features • Females tend to post emotionally. Frequent use of intensive adverbs ( terribly, awfully….) whereas men tend to more provocative

  5. Features…( cont. ) • Factor analysis and word classes • Factor or word factor analysis refers to the process of finding groups of similar words that tend to occur in similar documents. • POS sequence pattern • A POS sequence pattern is a sequence of consecutive POS tags that satisfy some constraints • POS ngrams are good at capturing the heavy stylistic and syntactic information. Instead of using all such n-grams, we want to discover all those patterns that represent true regularities, and we also want to have flexible lengths

  6. Features…( contd.. ) • POS Sequence patterns • mining algorithm mines all such patterns that satisfy the user-specified minimum support (minsup) and minimum adherence (minadherence) thresholds or constraints.

  7. POS sequence algo.

  8. Feature Selection • System uses EFS( Ensemble feature selection ) algorithm for selecting the fearutes. • EFS is a hybrid of filter and wrapper techniques • Some of the criteria used for feature selection are information gain, Mutual information, Chi square test • Feature value assignment • The values are either boolean or term frequency

  9. Experiments and results • Classifiers • Naïve Bayes • SVM • SVM Regression

  10. Results

  11. Comparison results

More Related