230 likes | 356 Vues
This paper addresses the challenges of online social network profile linkage, focusing on user identification across multiple platforms while considering privacy and security concerns. We define the problem of linking profiles from different sites by utilizing user footprints and propose a probabilistic model that leverages unique features for improved accuracy. Our experimental results demonstrate a significant enhancement in identity accuracy by applying iterative inference and prioritizing distinctive attributes, which effectively supports both commercial and governmental applications. Future work will explore further improvements and broader applications.
E N D
Large-Scale Cost-sensitiveOnline Social Network Profile Linkage
Background & Motivation • Foot prints in different social networks. • User identification in social analysis. • Privacy & security • Commercial & government applications
Outline • Problem definition • Related work • Approach • Experiment • Conclusion & future work
Problem Definition • Terminology • Identity: Person • Profile/User: Your footprint on social media • Profile Linkage: Link your footprints together • Input & Output • Input: profiles of one site as QUERY and profiles of the other site as TARGET. • Output: all pairs of classified matched profiles.
Characteristics of profile • Name (semi vs. structured) • {“given name”: “haochen”, “family name”: “zhang”} • name: zhanghaochen • Semi-structured schema • Incompleteness & missing attributes • Privacy policy • Virtual identification • Free text description • Bio, About me, Tags • Multilingualism
Multilingualism • Top 5 languages in dataset of Facebook • English • Portuguese • Spanish • Chinese • French • Most frequent tokens in different languages • chris, john, michael • chen, wang, lee • carlos, garcia, daniel • sergey, olga, alexander • About 70% users are in English • 7.2% users register as different locales • Transliteration • 昊辰 => Haochen
Feature Acquisition • Network communication costs too much time. • Usage limit of the web service. • 1000 invocations per day for Google Maps API • Compute complexity comparing to string similarity. • Image processing algorithm.
Local Features • Username • Jaro Winkler Similarity • Language • JaccardSimlarity • Description, URL • Cosine similarity with TF×IDF • Popularity • Defined as the friend amount of a user. • Adopt following metric
External Features • Geographic Location • Values are diverse with different types. • Google Maps API: • string-represented location => geographic information • Spherical distance between two locations as the feature • Avatar • χ2 dissimilarityof the avatar’s gray-scale histogram.
Classification: learning • Probabilistic model derived from naïve bayes • Independent feature assumption
Classification: learning • Iterative inference • Terminate if S_n is discriminative. • Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative • Order of the features
Classification: learning • Initial value • Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched. • as the initial value
Dataset of experiment • Data source • 152,294 Twitter users • 154,379 LinkedIn users • Ground truth: 9,750 identities • 4,779 identities with both accounts. • 3,339 identities with only Twitter account. • 1,632 identities with only LinkedIn account.
Experiment: Performance on overall linkage • I-Acc(Identity Accuracy) • correctly identified identities / all identities in ground truth • Better than naïve learning method caused by adopting the prior. • Different performance on different learning methods.
Experiment: Cost-sensitive feature acquisition • 5% improvement of F1 by taking 148743 external feature acquisitions. • Different order of external features. • Rank by cost • Rank by distinguishability • Three sections divided by two inflection points.
Discussion: dataset construction • Dataset construction • Connections • Cannot correctly reflect the web-scale occasion. • Name is too significant. • People search • Difficult to construct the ground truth. • Solution?
Discussion: people search task • Query in LinkedIn by Twitter user’s name • Average 10 results for each query
Discussion: feature dependency • Compare features independently. • 2 people in Tsinghua with same name Li Peng • 2 people in NUS with same name Li Peng • Construct different IDF table for name in different locale. • Not generally • Not significantly effective
Conclusion • We proposed an supervised probabilistic to solve the identity linkage problem effectively. • Prior that users sharing rarer tokens are more likely matched improves the performance of the approach. • Iterative inference is able to reduce unnecessary feature acquisitions.