Authorship Analysis in Cybercrime Investigation

Authorship Analysis in Cybercrime Investigation Rong Zheng, Yi Qin, Zan Huang, Hsinchun Chen Artificial Intelligence Lab University of Arizona

Outline • Introduction • Literature Review • Research Questions • Experimental Design • Results & Discussions • Conclusions & Future Directions • Questions & Comments

Introduction • Internet has offered us a much more convenient way to share information across time and place. • Cyberspace also opened a new venue for criminal activities. • Cyber attacks • Distribution of illegal materials in cyberspace • Computer-mediated illegal communications within big crime groups or terrorists • Cybercrime has become one of the major security issues for the law enforcement community.

Cybercrime • Definition: • Illegal computer-mediated activities that can be conducted through global electronic networks [Thomas, 2000] • Problems in cybercrime investigation • Data collection • Huge amount of online document • Rule Forming • Difficult to discern illegal document • Identity Tracing • Difficult to trace identities due to the anonymity of cybercrime • The anonymity of cyberspace makes identity tracing a significant problem which hinders investigations.

Possible Solution -- Authorship Analysis • An author might leave his unique “wordprint” in his writings. • Authorship analysis may identify the “wordprint” of the criminals. • For forensic purposes, this method has been used in a number of courts in England (the Court of Criminal Appeal), Ireland (the Central Criminal Court), Northern Ireland, and Australia.

Authorship Analysis in Cybercrime Investigation • A cyber criminal may have “wordprint” hidden in his online messages. • For example: Hi, I have several pretty cheap CD to sell. They are all brand new , and only $1 for each. Please contact pepter@yahoo.com if you are interested. • In this study, we propose to use the authorship analysis approach to solve the problem of identity tracing in cybercrime investigation. Has a greeting Specialcharacter Use email as contact method

Authorship Analysis • Categories: • Author identification • Author characterization • Similarity detection • Applications: • Disputed authorship literature • Shakespeare’s work, Federalist Papers • Software forensic • Virus authorship, source code plagiarism

Performance of Authorship Analysis • Two critical research issues influence the performance of authorship analysis: • Feature selection • Find out the effective discriminators • Analytical techniques • Approach to discriminating texts by authors based on the selected features

Feature Selection • Content specific features [Elliot, 1991] • key words, special characters • Style markers • Word/Character based features [Yule, 1938] • length of words, vocabulary richness • Syntactic features [Mosteller, 1964; Baayen, 1996] • function words(‘the’ ‘if’ ‘to’), punctuation • Structural features [Vel, 2000] • has a title/signature, has separators between paragraphs

Summary on Feature Selection • Content specific features are only effective in specific applications. • Word based features alone cannot represent writing style. But the combination of word based and syntactic features is very effective. [Baayen, 1996] • Structural features are helpful in Vel’s email applications. [Vel, 2000] • Style markers are the most frequently used features in past studies.

Analytical Techniques for Authorship Analysis • Statistical approaches • Univariate methods for authorship analysis • Thisted and Efron test [Thisted, 1987] • CUSUM [Farringdon 1996] • Multivariate methods for authorship analysis • Cluster analysis [Holmes, 1995] • Principle component analysis (PCA) [Burrow, 1987] • Linear discriminant analysis (LDA) [Baayen, 2002] • Machine learning approaches • Bayesian [Mosteller, 1984] • Decision tree [Apte, 1998] • Neural Network [Merriam, 1995; Bradley, 1996] • SVM [Diederich, 2000; Vel, 2001]

Summary on Analytical Techniques • Machine learning methods generally achieved higher accuracies than statistical methods in this field. • Machine learning methods can deal with a large set of features with less requirement on stringent mathematical models or assumptions than statistical methods. • The performance of authorship analysis largely depends on the quality of the feature set.

Challenges for Applying Authorship Analysis to Online Documents • Online documents are generally short in length. • The writing styles of online documents are less formal and the vocabulary is less stable. • The structure or composition style of online documents is often different from normal text documents. • Due to the internationalization of cybercrime, multilingual problems become a new challenge for authorship analysis.

Research Questions • Will authorship analysis techniques be applicable in identifying authors in cyberspace? • What are the effects of using different types of features in identifying authors in cyberspace? • Which classification techniques are appropriate for authorship analysis in cyberspace? • Will the authorship analysis framework be applicable in a multilingual context?

Experimental Design --Testbed • English Email Messages • 70 emails provided by 3 students • English Internet Newsgroup Messages • 153 potentially illegal messages written by 9 authors from misc.forsale.computers.pc-specific.software, misc.forsale.computers and mac-specific.software. • Chinese BBS Messages • 70 messages written by 3 authors from bbs.mit.edu

Experimental Design -- Techniques • Decision tree • Implemented C4.5 algorithm to deal with continuous values’ attributes for our datasets • Backpropagation neural network • Standard three-layer fully connected backpropagation neural network • Support vector machine • BSVM [Hsu, 2002] • Use linear kernel function • Set noise term to 1000

Experimental Design -- Feature Selection • For our English dataset, the feature selection was based on Vel’s study on email authorship analysis [Vel, 2000] (We added 36 style markers and 8 content specific features): • 206 style markers • 150 function words and 56 other language-based style features • 8 structural features • 9 content specific features • illegal content specific features • For our Chinese dataset, we preliminarily extracted 60 style markers and 7 structural features.

Procedures • Three steps: • Style markers were used in the first run. • Structural features were added in the second run. • Content specific features were added in the third run (newsgroup dataset only). • This procedure was repeated for each of the three algorithms.

Measures

Experimental Results

Discussions -- Techniques • SVM and neural networks achieved better performance than the C4.5 decision tree algorithm. • This confirmed the results in previous studies. [Diederich, 2000] • SVM generally had the best performance because of its capability of dealing with a large set of input features.

Discussions -- Feature Selections • Using style markers alone, we achieved high accuracy. • Style markers and the techniques are effective. • Using style markers and structural features outperformed using style markers only (with p-values < 0.05). • Consistent personal patterns exist in the message structures. • Using style markers, structural features, and content specific features did not outperform using style markers and structural features (with p-value of 0.3086). • The content distinction of those messages is not significant. • Style marker and structural feature are highly effective.

Discussions -- Datasets • The measures of prediction performance drop significantly for the Chinese dataset compared with the English datasets. • We only used 67 features for the Chinese dataset. • Larger set of function words are needed. • Nevertheless, we achieved 70% - 80% accuracy.

Conclusions • The experimental results indicated a promising future for applying the authorship analysis approaches in cybercrime investigation to address the identity-tracing problem. • Structural features are significant discriminators for online documents. • SVM and neural network techniques achieved high performance. • This approach is promising in the multilingual context.

Future Directions • More illegal messages will be incorporated into our testbed. • The current approach will be extended to analyze the authorship of other cybercrime-related materials, such as bomb threats, hate speeches, and child-pornography. • Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset.

Questions & Comments Thank you!

Authorship Analysis in Cybercrime Investigation