240 likes | 362 Vues
22 nd USENIX Security Symposium (USENIX Security '13). The Velocity of Censorship: High-Fidelity Detection of Microblog Post Deletions. Tao Zhu 1 ; David Phipps 2 ; Adam Pridgen 3 ; Jedidiah R. Crandall 4 ; Dan S. Wallach 3 1 Independent Researcher 2 Bowdoin College
E N D
22nd USENIX Security Symposium (USENIX Security '13) The Velocity of Censorship: High-Fidelity Detection of Microblog Post Deletions Tao Zhu1; David Phipps2; Adam Pridgen3; JedidiahR. Crandall4; Dan S. Wallach3 1Independent Researcher 2Bowdoin College 3Rice University 4University of New Mexico 左昌國 2013/09/10 Seminar @ ADLab, CSIE, NCU
Outline • Introduction • Methodology • Hypotheses • Topic Extraction • Discussion • Conclusion
Introduction • Microblogs in China : Weibo • SinaWeibo ( http://weibo.com ) • 503 million registered users (Dec. 2012) • 100 million messages sent daily • Promoting visibility of social issues • China employs both backbone-level filtering of IP packets and higher level filtering implemented in the software • Many works focus on how and what to filter • This paper focuses on how quicklymicroblog posts are removed
Introduction • Contributions: • The implementation of a method that detect a censorship event within 1-2 mins of its occurrence • To understand how Weibo can react so quickly in terms of deleting posts with sensitive content • 4 hypotheses • To overcome the usage of neologisms, named entities, and informal language in Chinese for topical analysis
Methodology • Identifying the sensitive user group • Crawling posts of sensitive user group • Detecting deletions
Methodology – Identifying the Sensitive User Group • Search the outdated sensitive keywords in China Digital Times (http://chinadigitaltimes.net/2013/06/two-years-of-sensitive-words-grass-mud-horse-list/) • Using the keywords like “党产共”; 2011-4 ~ 2012-10 • Starting with 25 sensitive users (manually selected) 26 25 sensitive users > 5 deletion > 5 reposts for each user
Methodology - Identifying the Sensitive User Group • Sensitive group reaches 3567 users after 15 days • More than 4500 post deletions daily • 1500 “permission denied” posts • 12% of the total posts from the group were eventually deleted • This methodology cannot a representative sample of the whole Weibo
Methodology - Crawling • User timeline : • Weibo user timeline API returns the most recent 50 posts of the specified user. • Querying 3567 sensitive users one per minute • 100 accounts for API call • 300 concurrent Tor circuit • Four-node cluster running Hadoop and HBase
Methodology – Detecting Deletions • If a post is in the database but is not returned from Weibo issue a secondary query for that post to determine what error message is returned • Permission-denied or system deletion • “Permission-Denied” error • Caused by censorship event • The post still exists but cannot be accessed by users • General deletion • “Post does not exist” error • May caused by user self deletion or censorship events • The post does not exist.
Methodology – Detecting Deletions • This paper focuses on system deletions • Apparently not by users • From July 2012 to September 2012, 2.38 million posts were collected, with a 12.8% total deletion rate (4.5% for system deletions and 8.3% for general deletions). • The lifetime of a post is the time difference between the time the system detected the post being deleted and the creation time. • The measurement fidelity is on the order of minutes
Hypotheses • How can the Weibo system find sensitive posts and remove them so quickly? • How are those sensitive posts located by the moderators after a month in the huge database? • Weibo has different strategies to target sensitive contents
Hypotheses • Hypothesis 1: • Weibo has filtering mechanisms as a proactive, automated defense • Explicit filtering • Implicit filtering • “shishikanfalunhowle” • Camouflaged posts
Hypotheses • Hypothesis 2: • Weibo targets specific users, such as those who frequently post sensitive content
Hypotheses • Hypothesis 3: • When a sensitive post is found, a moderator will use automated searching tools to find all of its related reposts (parent, child, etc.), and delete them all at once
Hypotheses • Hypothesis 4: • Deletion speed is related to the topic. That is, particular topics are targeted for deletion based on how sensitive they are. • Main 5 topics: • Qidong • QianYunhui • Beijing Rainstorm • Diaoyu Island • Group Sex
Topic Extraction • Automatic methods are needed to classify the posts • TF*IDF (https://zh.wikipedia.org/wiki/TF-IDF) • Assign weights to the terms (n-grams) of a document • Pointillism approach [27] • Reconstruction from grams to words and phrases using external information
Topic Extraction • 李W阳 (Li Wangyang, from 李旺阳) • 六圌四 (June Fourth, from 六四) • 胡()涛 (Hu Jintao, from 胡锦涛) • 启-东, 启\东 and 启/东 (Qidong, from 启东)
Topic Extraction • Which topics among these have been discussed for the longest period of time? • Independent Component Analysis (ICA) • Beijing, government, China, country, policeman, and people • These 6 terms appear in almost every individual topic
Discussion – Filtering Mechanisms • Proactive mechanisms • Hypothesis 1 • Backwards reposts search • Hypothesis 3: chain reposts deletion • Backwards keyword search • Similar to hypothesis 3: relative keywords deletion • 兲朝 • 37人(http://news.now.com/home/international/player?newsId=40857) • Monitoring specific users • Hypothesis 2
Discussion – Filtering Mechanisms • Account closures • 300 user accounts closed • Search filtering • Public timeline filtering • User credit point • Users can report sensitive or rumor-based posts to earn points
Conclusion • Deletions happen most heavily in the first hour • 90% of the deletions happen within the first 24 hours • The 4 hypotheses