1 / 8

Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables

Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables. Xuwen Cao Beyang Liu. Process Outline. Identify entities in 3891 leaked U.S. diplomatic cables published by Wikileaks Extract features from window around entities Sentiment scores Co-occurying entities

grady
Télécharger la présentation

Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Clustering of People, Places & Organizations in U.S. Diplomatic Cables Xuwen Cao Beyang Liu

  2. Process Outline • Identify entities in 3891 leaked U.S. diplomatic cables published by Wikileaks • Extract features from window around entities • Sentiment scores • Co-occurying entities • Adjectives in some fixed-size window • Cluster entities in feature space

  3. K-Means Clustering • Stanford NLP (NER + POS) • Extract Locations (LOCATION & NN) • eg. London, Africa, China, Caucasus • Sentiment Analysis on JJ (SentiWordNet) • Calibrate Using sentiment towards US • Frequency Counting

  4. K-means Results Entity frequency Sentiment score

  5. Multinomial Mixture Model • Model many features as (probabilistic) function of cluster assignment • Naïve Bayes independence assumption • Maximize expected log-likelihood objective with EM (Cluster Label) (Features)

  6. EM Initialization Issues Histograms of cluster sizes (k = 100)

  7. Sample Clusters from Multinomial Mixture Model • Examples • Good • cairo iran saudi arabiawest bankpalestinianauthorityqatar middle eastkarachi maliki • tripolidutch franceabujamuammaral-qadhafiicc (international criminal court) • Bad • atmar ben ali saleh european union eu icrc (red cross) wto ahmadinejad • helmand, karzai, seoul, brown, williams, tadic • Many other clusters very small or heterogeneous • Model seems to be cuing off of co-occurrence features the most

  8. Future Direction • More advanced features, targeted toward sentiment • E.g. n-gram adjective phrases • Better model: mixture of CRF clustering, rather than Naïve Bayes

More Related