AMCS/CS 340: Data Mining

Project Topics AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Project Software implementation related to course subject matter. Should involve an original component or experiment. Project: 40% The score of your project will be evaluated by: Technical quality (30) + significance (30) + novelty/impact (20) + report/ presentation (20) Start your project earlier! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Team Projects Working individually ? Or in pairs ? Both OK, but … No more than 2 per project We will expect more from a pair than from an individual. The effort should be roughly evenly distributed. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Projects: introduction • It can be a work • discovering interesting relationships, patterns within a significant amount of data • having some original idea that extends/builds on what we learned in class • extending/improving/speeding-up some existing algorithm • defining a new problem and solving it 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Projects: Deliverables 1 • Project proposal: A one-page description of what you plan to do for your project, due Oct. 17th. • Submit to the blackboard system • For the two in one team • Each of you should write a proposal • Clearly define your part in the project • Midterm report: A 1- or 2-page report of what you have done in one month, due Nov 12nd. • Submit to the blackboard system • For the two in one team • Each of you should write a report • report your part in the project 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Projects: Deliverables 2 • Final project report: A comprehensive description of your project (5-10 pages), due Dec 3rd. • Submit to the blackboard system • For the two in one team • One report co-authored by both of you • Clarify each one’s contribution in the project • Final presentation: In the last week of class (Dec 4th and 7th), each team presents their project to the rest of the class. The slides of your project will be posted online here, showing your work to more people. • Each team : 10 mins • For the two in one team: each one, 5 mins 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Projects with another class ? • Are you doing a project for another class? Combine the two projects to a big one ? • Yes, you can do that, but show your contribution using Data Mining techniques • Expect more contribution than a single project • You want to have a project related to your own research topic ? e.g. for a Ph.D. student • Yes, you can do that • Show your contribution using Data Mining techniques 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Project work suggestions • Start earlier • you only have 7 weeks • Have difficulties • talk to your partners, your classmates, your supervisor, me, … • prepare the question when you take job interview: what DM techniques have you used to solve which problem? 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Project ideas • Text mining • Enron E-mail Dataset • The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available at: http://www.cs.cmu.edu/~enron/ • Project ideas: • Can you classify the text of an e-mail message to decide who sent it? • Automatic Categorization of Email into Folders • Recipient recommendation systems, i.e., suggesting who recipients of a message might be, while the message is being composed, given its current contents and given its previously-specified recipients. • Search in Google for references 9 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Project ideas • Text mining • 20 Newsgroups data set • The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups http://people.csail.mit.edu/jrennie/20Newsgroups/ • Project ideas: • Extract keywords, predict article category • Named entity recognition • Entity relation extraction 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Project ideas Social network/Graph mining Stanford large network dataset collection http://snap.stanford.edu/data/index.html Social networks: online social networks, edges represent interactions between people Communication networks: email communication networks with edges representing communication Citation networks: nodes represent papers, edges represent citations Collaboration networks: nodes represent scientists, edges represent collaborations (co-authoring a paper) Web graphs: nodes represent webpages and edges are hyperlinks Blog and Memetracker graphs: nodes represent time stamped blog posts, edges are hyperlinks Amazon networks: nodes represent products and edges link commonly co-purchased products Internet networks: nodes represent computers and edges communication Road networks: nodes represent intersections and edges roads connecting the intersections Autonomous systems: graphs of the internet Signed networks: networks with positive and negative edges (friend/foe, trust/distrust) 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Project ideas • Social network/Graph mining • Project ideas: • Find communities, clusters in such a big graph • Count frequent subgraphs • Design algorithms to characterize the structure of the network as a whole • Predict links 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Project ideas • Recommendation • Movie Rating: • Netflix prize dataset: http://www.netflixprize.com/ • 17770 movies • 480,189 users • 100,480,507 ratings • On 21 September 2009, the grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10%. [paper] • Project ideas: • collaborative filtering algorithm to predict user ratings for movies, based on previous ratings. 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Project ideas • Comparative studies of DM algorithms • Comparison of SVM implementations (done in 2010) • provide a systematic comparison of several implementations of SVM. • Comparison of algorithms for semi-supervised clustering • Groups the similar objects together (done in 2010) • Incorporates additional information (e.g., the known labels of some of the objects) into the computation of object distance • Study and compare implementations of parallel formulations of clustering techniques. • A comparative study of techniques for clustering association rules. • A comparative study and implementation of classification using association patterns (rules and itemsets) 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Projects of DM course at Stanford • Frequency-DomainCharacterization of Trending Topics • Identifying Trending Topics on Twitter • Wikipedia vandalism • Product Offer Comparison across Different Merchants • Extracting Information from Yelp Reviews • Exploring Methods of De-Novo Short Read Assembly Using MapReduce • Topic Chaining and Phrase Linking • Understanding Correlations between Product Reviews and Ratings • Finding the Social Roots of Controversy in Wikipedia • Techniques to improve detection of trending topics on Twitter • Mining Hospital Records for Predicting Patient Drop-off • Social Information Engine: Data Mining Twitter for Product Recommendations • Comparing the impact of cross-disciplinary and cross-institutional academic research: • Woodstock: Using Twitter tweets' sentiments to predict stock price change • Book Recommendation System(done in 2010 at KAUST : BOOKRANK) • Seven years of Wikipedia's Revision History as a Time dependent Graph: A Love Story • Adaptive Locality Sensitive Hashing for Recommending Twitter Followers • CombiningContent Filtering and Collaborative Filtering for the Netflix Prize • Hashtags on Twitter • Collaborative Filtering on Netflix Challenge • A Music Recommendation System • Content Based Auto-tagging of Flickr Images using ImageWebs • A Data Mining Based Approach to Determining Causal Associations Between Drugs and Condition • Twitter Personal Newspaper • WikiSuggest: A Suggestion Engine for Editors on Wikipedia 15

Projects of DM course at U. Illinois • Replying Relationship Reconstruction on Forum Conversation: a Link Mining Approach • Entity page retrieval: Use web structures to find webpages which represent entities • Simulation Cube: A Framework for Sampling, predicting and Analysis of data. • Applying data mining for hardware design validation • Strategy Mining using Game Replays: Develop a replay tagging system for extraction, classification and clustering on game strategies. • Graph Regularized Ranking-based Clustering of Heterogeneous Information Networks • Effective Classification Method for Searching Promising Compounds • Discovering Entity Relationship Across Multiple Web Data Sources • GoldMine: Automatic Assertion Generation using Association Rule Mining and Set Covering • Mining Representative Frequent Patterns Using the MDL Tree-cut Model • Exploiting Structural and Visual Layout Patterns for Entity List Extraction • DIAMOND: Real-Time Anomaly Monitoring Daemon for DIME • Understanding Multimedia in Online Community - Image and Text Combined Mining • Query/Page Classification Through Label Propagation Using Query Logs • Informative Sentence Mining in Multi-Product Document by PLSA • Topic Model with Frequent Sequential Pattern • Evolutionary Clustering and Analysis of Bibliographic Networks • A Recommendation System that can compare two entities based on specified features, and recommend the best amongst them. • MedClus: Extending NetClus to the PubMed publication network 16

Projects of DM course at KAUST • BioMedical Text Mining 1. HaithamAshoor, Text Disambiguation 2.Abdullah M Khamis, Statistical Learning Based System for Text Classification 3. LailatulHidayah and PutriNovianti, Document Classification in Biological Literature Based on Substring Selection • Movie Recommendation system 4. NedhalMourad and Yasser Ebrahim, Refining Collaborative Recommendations Using Dissimilar Elements 5. Feng Yan,Item-based Netflix Recommendation System 6. Ameer Khan and Francisco Franco, Collaborative Filtering based on adjusted Pearson Correlation with an exploration of the effects of clustering 7. Abdullah Kassas and Eyad Al Sibai, Hybrid Recommendation System 8. Tyler Barth, Improving Collaborative Filtering through Content-based Data Set Augmentation

Projects of DM course at KAUST • Algorithms comparison 9.Samara Alcantar, Comparison of SVM Implementations 10.Ka Chun Wong, Interesting Clustering Algorithms Implementation and Testing • Text Mining 11.EhabAbdelhamid, Document Clustering using Graph Mining Techniques 12.Wail Ba alawi and KhaledSaeed, Document Categorization by NN, DT, NB, and K-nn 13.Abdulelah Bin Mahfoodh, Using Readability Tests for Prediction in Text Categorizations 14. Kwan Wai FAN, Text categorization through different data mining methods 15.Tak Man Desmond Lee, News Recommendation System 16.Ahmad Lutfullah, NBA:Outstanding Players, Positions Inference and Ranking Forecast

Projects of DM course at KAUST • Social Network/Graph 17.Wael Al-Alwani, Twitter Demographical Communities and Message Clustering 18.Guoda Chen, Connection Subgraph Mining on Social Network Graph 19. Kenneth Bailey, Runtime Email Recommendation System 20. Adrian Reyes and Santiago Ganis, BookRank: A Ranking System for Books Sold at Amazon 21. José A. Valenzuela and Mustafa Nabulsi, Meme-Tracking and News' Cycle Analysis • Parallelization/Multi-core problem 22. ManalKalkatawi and ZuhairKhayyat, Parallel Random Forest Trees 23. TareqMalas, Parallel SVM automatic parameter search for parallel multi-core architectures 24. FaraniaRangkuti, Single Core vs Multi-core Implementation of Neural Network 25. Karim Ahmed Awara, Scaling Affinity Propagation with Data Cyclotron

Projects of DM course at KAUST • Time Series Analysis 26. Chandra PrasetyoUtomo, Hybrid Intelligent System in Permeability Prediction in Oil Well Logs 27.Ka-Ying Lam, Hard Stringers Detection in Oil Well by Time Series Analysis 28. Faisal Ramay, Incorporating News to Improve Prediction of Stock Price Trends 29.ShuaiZheng, Exploring the network flow dataset using data mining skills 30. ChengbinPeng and Hong Zhu, Time Series Clustering 31. Eric Shiu, HPC Workload Forecasting with k-NN • Bioinformatics 32. Othman Soufan, ECG Classification using An Integration Technique to detect different types of diseases in the heart beat electrocardiograms ECG signals 33. Gregorio Alanis Lobato, Manifold Learning Applied to Protein-Protein Interaction Network Reliability Assessing 34. Abdullah Abdullah and MalekMahayni, Protein Interaction Prediction

The Projects proposal • Answer the following questions: • What is the problem you are solving? • What data will you use (where will you get it)? • How will you do it? • Which algorithms/techniques you plan to use? • Be as specific as you can! • How will you evaluate your result? • What do you expect to submit in the end? 21 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining