1 / 11

Gender Classification of Japanese Authors

Gender Classification of Japanese Authors. David Edwards & Cybelle Smith. Gendered Speech in Japanese. Gender of speaker may be overtly marked:     Gender-specific first-person pronouns 僕 ,  boku,  male;  俺 , ore , male;  私 , watashi , female or neutral

evita
Télécharger la présentation

Gender Classification of Japanese Authors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gender Classificationof Japanese Authors David Edwards & Cybelle Smith

  2. Gendered Speech in Japanese Gender of speaker may be overtly marked:     Gender-specific first-person pronouns 僕, boku, male; 俺, ore, male; 私, watashi, female or neutral Question: Does gender have less-overt effects on Japanesetexts as well? Can word choice, morphology, writing style indicate gender,          even in noisy environments like fiction writing?

  3. “Peace” Corpus 29 personal essays by middle school students Topic: “Peace” 29 authors: 22 female 7 male “Bookstudio” Corpus 485 installments of online novels Genre: Fantasy 40 authors 20 female 20 male Also collected ~181 installments from authors of unknown gender (for future research) Corpora

  4. Our Baseline - The “Boku” Test

  5. Classifiers Used Naïve Bayes:     Build conditional probabilities of features given gender     Calculate probability of test data given a particular gender     Select highest-probability gender SVM:     Used the LIBSVM free classifying tool     Find dividing hyperplane in num-feature dimensional space         - Requires problem-specific parameters chosen via             cross-validation     Apply hyperplane to test data Also attempted Logistic Regression

  6. Chasen: Segmenter and POS-tagger Stem Pronun Lemma Part of Speech -ciation 記号-空白 光ヒカリ光名詞-一般 がガが助詞-格助詞-一般 彷徨ホウコウ彷徨名詞-サ変接続 うウうい形容詞-自立形容詞・アウオ段ガル接続 ようヨウよう名詞-接尾-一般 なナだ助動詞特殊・ダ体言接続 暗きクラキ暗い形容詞-自立 闇ヤミ闇名詞-一般

  7. Features Stem Pron Lemma POS 暗きクラキ暗い形容詞-自立 KURAki kuraki KURAi adjective - independent

  8. Features 私 わたし ワタシ Kanji (Chinese character) Hiragana (phonetic) Katakana (phonetic, like italics)

  9. Single-feature performance on Naive-Bayes: Multi-feature performance on Naive-Bayes:

  10. SVM Performance • Optimizations:     • Scaling counts to avoid swamping low-frequency features   • Selecting optimal error rate and kernel parameters

  11. Conclusion • Without considering gendered pronouns, we achieved similar performance • Most-indicative feature: wordshape (use of kanji vs. hiragana vs. katakana etc.), especially where multiple options exist • Point of interest: male and female Japanese authors differ not just in the words they use, but how they choose to write those words

More Related