80 likes | 190 Vues
Final Review. 7-Text Mining. Unstructured Data Two modes of mining: analysis vs. retrieval Precision vs. Recall as metric With lots of data you can find anything Tools for text mining Stopwords, stemming Term document matrix TF-IDF Latent Semantic Indexing (LSI)
 
                
                E N D
7-Text Mining • Unstructured Data • Two modes of mining: analysis vs. retrieval • Precision vs. Recall as metric • With lots of data you can find anything • Tools for text mining • Stopwords, stemming • Term document matrix • TF-IDF • Latent Semantic Indexing (LSI) • Uses PCA to find ‘concepts’ (topics) • Documents that share concepts will be close • Probabilistic Models • Naïve Bayes vs. Multinomial • LDA: Documents from Topics from Words
8-Web Mining • Detecting robots • Markov Models for Page prediction • Ranking web pages • Flow model • Power iteration • Random walk and the stationary distribution • Spider traps and how to get around them • Adwords model for advertising cost-per-click
9-Advanced Classification • Neural Networks • Neuron: inputs, linear combination, activation function, output • Architecture: layers, nodes per layer • Training through back propagation • Good for complex problems like face detection, speech, video • Support Vector Machines • Assume classes are separable • Plus/minus plane, margin, support vectors • Finds the maximum margin separable classifier • If not separable, use “kernel trick”
10-Ensembles • Ensemble Methods • Collections of ‘small’ models can fit something complex • Typically beats individual models • Model Averaging • Boosting – fit to models with error upweighted • Bagging – fit to bootstrapped versions of data • Random Forests – fit to trees with random variables at each split
11-Bayesian Methods • Hierarchical Modelling with MCMC • No pooling vs complete pooling vs. Bayesian solution • Priors tell how much you should depend on the data • Congugate priors (e..g beta/binomial) make life easy. • MCMC for other cases • Metropolis Hastings: sample from the posterior • Use trace plots to assess convergence
12-Recommender Systems • Netflix Prize • We won! • Recommender Systems • Evaluation via RMSE or DCG • Nearest Neighbors • SVD • Ensembles (of teams, of models) very powerful
13-Networks • Nodes and edges • Node and edge centrality • Degrees and degree distribution • Network Models • Erdos/Renyi • Preferential Attachment • Power Law graphs • Small world networks