Predictive Modeling for User Movie Ratings

Simon Funk: Netflix provided a database of 100M ratings (1 to 5) of 17K movies by 500K users. as a triplet of numbers: (User,Movie,Rating). The challenge: For (User,Movie,?) not in the database, predict how the given User would rate the given Movie. Think of the data as a big sparsely filled matrix, with userIDs across the top and movieIDs down the side (or vice versa then transpose everything), and each cell contains an observed rating (1-5) for that movie (row) by that user (column), or is blank meaning you don't know. This matrix would have 8.5B entries, but you are only given values for 1/85th of those 8.5B cells (or 100M of them). The rest are all blank. Netflix posed a "quiz" of a bunch of question marks plopped into previously blank slots, and your job is to fill in best-guess ratings in their place. Squared error (se) measures accuracy (Your guess = 1.5, actual = 2, you get docked for (2-1.5)^2 or .25. They use root mean squared error (rmse), but rmse and mse monotonically related.) There is a date for ratings and question marks (so a cell can potentially have >=1 rating in it. Any movie can be described in terms of some aspects or attributes such as overall quality, action(y/n?), comedy(y/n?), stars, producer, etc. Every user's preferences can be roughly described in terms of whether they tend to rate quality/action/comedy/star/producer/etc. high or low. If true, then ratings ought to be explainable by a lot less than 8.5 billion numbers (e.g., a single number specifying how much action a particular movie has may help explain why a few million action-buffs like that movie.) SVD assumes rating(u,m) is sum of preferences about the various aspects. E.g., take 40 aspects - a movie, m, is described by 40 values, m(f), saying how much that movie exemplifies that aspect, and a user is described by 40 values, u(f), saying how much they prefer each aspect. A rating(u,m) = u(f) dot m(f) (40*(17K+500K) values =~20M << 8.5B.) or: ratingsMatrix[user][movie] = sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40 (or 1 to F in general) Ri1i iTestSizeI u1 . u . uTestSizeU UTf1f fF u1 : : uTestSizeU I i1i iTestSizeI f1 fF o f rf,i ru,i = u ru,f R = UT o I ru,i = u o i = f=1..F ru,f * rf,i ^ ^ u+ = lrate ( u,i * iT -  * u ) where u,i = ru,i - ru,i andru,i = actual rating The original matrix has been decomposed to 2 oblong matrices: 17Kx40 movie aspect matrix, 500Kx40 user preference matrix. SVD is a trick for finding the 2 smaller matrices which minimize the resulting approx error--specifically the mean squared error. So, if we take the rank=40 SVD of the 8.5B matrix, we have the best (least error) approx we can within the limits of our user-movie-rating model. I.e., the SVD has found "best" generalizations. Take the derivative of the approx error and follow it. This has the bonus that we can ignore the unknown error on the 8.4B empty slots. Take the derivative of the equations for the error--just the given values, not the empties--with respect to the parameters: userValue[user] += lrate*err*movieValue[movie]; movieValue[movie] += lrate*err*userValue[user]; With Horizontal data, the code is evaluated for each rating. So, to train for one sample:real *userValue= userFeature[featureBeingTrained]; real *movieValue= movieFeature[featureBeingTrained]; real lrate = 0.001; More correctly: ru,f = uv = userValue[user] += err * movieValue[movie]; rf,i = movieValue[movie] += err * uv; finds the most prominent feature remaining (most reduces error). When it's good, shift it onto done features, start a new one (cache residuals of the 100M. "What does that mean for us???). This Gradient descent has no local minima, which means it doesn't really matter how it's initialized.

Refinements:Prior to starting SVD, Note: AvgRating(movie), AvgOffset(UserRating, MovieAvgRating), for every user. I.e.: static inline real predictRating_Baseline(int movie, int user) {return averageRating[movie] + averageOffset[user];} So, that's the return value of predictRating before the first SVD feature even starts training. You'd think avg rating for a movie would just be... its average rating! Alas, Occam's razor was a little rusty that day. If m only appears once with r(m,u)=1 say, AvgRating(m)=1? Probably not! View r(m,u)=1 as a draw from a true prob dist who's avg you want... View that true average itself as a draw from a prob dist of averages--the histogram of average movie ratings. Assume both distributions Gaussian, then the best-guess mean should be lin combo of observed mean and apriori mean, with a blending ratio equal to the ratio of variances. If Ra and Va are the mean and variance (squared standard deviation) of all of the movies' average ratings (which defines your prior expectation for a new movie's average rating before you've observed any actual ratings) and Vb is the average variance of individual movie ratings (which tells you how indicative each new observation is of the true mean--e.g,. if the average variance is low, then ratings tend to be near the movie's true mean, whereas if the avg variance is high, ratings tend to be more random and less indicative) then: BogusMean = sum(ObservedRatings)/count(ObservedRatings) K = Vb/Va BetterMean = [GlobalAverage*K + sum(ObservedRatings)] / [K + count(ObservedRatings)] The point here is simply that any time you're averaging a small number of examples, the true average is most likely nearer the apriori average than the sparsely observed average. Note if the number of observed ratings for a particular movie is zero, the BetterMean (best guess) above defaults to the global average movie rating as one would expect. Moving on: 20M free params is a lot for a 100M TrainSet. Seems neat to just ignore all blanks, but we have expectations about them. As-is, this modified SVD algorithm tends to make a mess of sparsely observed movies or users. If you have a user who has only rated 1 movie, say American Beauty=2 while the avg is 4.5, and further that their offset is only -1, we'd, prior to SVD, expect them to rate it 3.5. So the error given to the SVD is -1.5 (the true rating is 1.5 less than we expect). m(Action) is training up to measure the amount of Action, say, .01 for American Beauty (ust slightly more than avg). SVD optimize predictions, which it can do by eventually setting our user's preference for Action to a huge -150. I.e., the alg naively looks at the only example it has of this user's preferences and in the context of only the one feature it knows about so far (Action), determines that our user so hates action movies that even the tiniest bit of action in American Beauty makes it suck a lot more than it otherwise might. This is not a problem for users we have lots of observations for because those random apparent correlations average out and the true trends dominate. We need to account for priors. As with the average movie ratings, blend our sparse observations in with some sort of prior, but it's a little less clear how to do that with this incremental algorithm. But if you look at where the incremental algorithm theoretically converges, you get: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2)] The numerator there will fall in a roughly zero-mean Gaussian distribution when charted over all users, which through various gyrations: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2 + K)] And finally back to: userValue[user] += lrate * (err * movieValue[movie] - K * userValue[user]); movieValue[movie] += lrate * (err * userValue[user] - K * movieValue[movie]); This is equivalent to penalizing the magnitude of the features. To cut over fitting, allowing use of more features.

Moving on: Linear models are limiting. We've bastardized the whole matrix analogy so much that we aren't really restricted to linear models: We can add non-linear outputs such that instead of predicting with: sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. We can use: sum G(userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. Two choices for G proved useful. 1. clip the prediction to 1-5 after each component is added. E.g., each feature is limited to only swaying rating within the valid range, and any excess beyond that is lost rather than carried over. So, if the first feature suggests +10 on a scale of 1-5, and the second feature suggests -1, then instead of getting a 5 for the final clipped score, it gets a 4 because the score was clipped after each stage. The intuitive rationale here is that we tend to reserve the top of our scale for the perfect movie, and the bottom for one with no redeeming qualities whatsoever, and so there's a sort of measuring back from the edges that we do with each aspect independently. More pragmatically, since the target range has a known limit, clipping is guaranteed to improve our perf, and having trained a stage with clipping on, use it with clipping on. I did not really play with this extensively enough to determine there wasn't a better strategy. A second choice for G is to introduce some functional non-linearity such as a sigmoid. I.e., G(x) = sigmoid(x). Even if G is fixed, this requires modifying the learning rule slightly to include the slope of G, but that's straightforward. The next question is how to adapt G to the data. I tried a couple of options, including an adaptive sigmoid, but the most general and the one that worked the best was to simply fit a piecewise linear approximation to the true output/output curve. That is, if you plot the true output of a given stage vs the average target output, the linear model assumes this is a nice 45 degree line. But in truth, for the first feature for instance, you end up with a kink around the origin such that the impact of negative values is greater than the impact of positive ones. That is, for two groups of users with opposite preferences, each side tends to penalize more strongly than the other side rewards for the same quality. Or put another way, below-average quality (subjective) hurts more than above-average quality helps. There is also a bit of a sigmoid to the natural data beyond just what is accounted for by the clipping. The linear model can't account for these, so it just finds a middle compromise; but even at this compromise, the inherent non-linearity shows through in an actual-output vs. average-target-output plot, and if G is then simply set to fit this, the model can further adapt with this new performance edge, which leads to potentially more beneficial non-linearity and so on... This introduces new free parameters and encourages over fitting especially for the later features which tend to represent small groups. We found it beneficial to use this non-linearity only for the first twenty or so features and to disable it after that. Moving on: Despite the regularization term in the final incremental law above, over fitting remains a problem. Plotting the progress over time, the probe rmse eventually turns upward and starts getting worse (even though the training error is still inching down). We found that simply choosing a fixed number of training epochs appropriate to the learning rate and regularization constant resulted in the best overall performance. I think for the numbers mentioned above it was about 120 epochs per feature, at which point the feature was considered done and we moved on to the next before it started over fitting. Note that now it does matter how you initialize the vectors: Since we're stopping the path before it gets to the (common) end, where we started will affect where we are at that point. I wonder if a better regularization couldn't eliminate overfitting altogether, something like Dirichlet priors in an EM approach--but I tried that and a few others and none worked as well as the above. Here is the probe and training rmse for the first few features with and w/o regularization term "decay" enabled. Same thing, just the probe set rmse, further along where you can see the regularized version pulling ahead: This time showing probe rmse (vertical) against train rmse (horizontal). Note how the regularized version has better probe performance relative to the training performance: Anyway, that's about it. I've tried a few other ideas over the last couple of weeks, including a couple of ways of using the date information, and while many of them have worked well up front, none held their advantage long enough to actually improve the final result. If you notice any obvious errors or have reasonably quick suggestions for better notation or whatnot to make this explanation more clear, let me know. And of course, I'd love to hear what y'all are doing and how well it's working, whether it's improvements to the above or something completely different. Whatever you're willing to share,

struct Customer { int CustomerId; int RatingCount; int RatingSum; }; struct Data { int CustId; short MovieId; BYTE Rating; float Cache; }; class Engine { private: int m_nRatingCount; // Current number of loaded ratings Data m_aRatings[MAX_RATINGS]; //Array of ratings data Movie m_aMovies[MAX_MOVIES]; //Array of movie metrics Customer m_aCustomers[MAX_CUSTOMERS]; //Array of customer metrics float m_aMovieFeatures[MAX_FEATURES][MAX_MOVIES]; //Array of feat by mov float m_aCustFeatures[MAX_FEATURES][MAX_CUSTOMERS];//Array feas by cust IdMap m_mCustIds; // Map for one time translation of ids to compact array index inline double PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing=true); inline double PredictRating(short movieId, int custId); bool ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut); bool ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue); bool ParseFloat(wchar_t*pwzBuffer, int nLength,int &nPosition,float& fValue); public: Engine(void); ~Engine(void) { }; void CalcMetrics(); void CalcFeatures(); void LoadHistory(); void ProcessTest(wchar_t* pwzFile); void ProcessFile(wchar_t* pwzFile); }; //============= // Program Main int _tmain(int argc, _TCHAR* argv[]) { Engine* engine = new Engine(); engine->LoadHistory(); engine->CalcMetrics(); engine->CalcFeatures(); engine->ProcessTest(L"qualifying.txt"); wprintf(L" Done "); getchar(); return 0; } //======================================================= // SVD Sample Code (C) 2007 Timely Development (www.timelydevelopment.com) // STANDARD DISCLAIMER: // - THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY // - OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT // - LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR // - FITNESS FOR A PARTICULAR PURPOSE. //==================================================== #define WIN32_LEAN_AND_MEAN #include <windows.h> #include <stdio.h> #include <math.h> #include <tchar.h> #include <map> using namespace std; //================================================ // Constants and Type Declarations //================================================ #define TRAINING_PATH L"C: etflix raining_set*.txt" #define TRAINING_FILE L"C: etflix raining_set\%s" #define FEATURE_FILE L"C: etflixfeatures.txt" #define TEST_PATH L"C: etflix\%s" #define PREDICTION_FILE L"C: etflixprediction.txt" #define MAX_RATINGS 00480508 //Ratings in entire training set (+1) #define MAX_CUSTOMERS 480190 //Custs in entire training set (+1) #define MAX_MOVIES 17771 //Movies in entire training set (+1) #define MAX_FEATURES 64 //Number of features to use #define MIN_EPOCHS 120 //Min number of epochs per feature #define MAX_EPOCHS 200 // Max epochs per feature #define MIN_IMPROVEMENT 0.0001 //Min improve req cont current feature #define INIT 0.1 // Initialization value for features #define LRATE 0.001 // Learning rate parameter #define K 0.015 // Reg param to min over-fitting typedef unsigned char BYTE; typedef map<int, int> IdMap; typedef IdMap::iterator IdItr; struct Movie { int RatingCount; int RatingSum; double RatingAvg; double PseudoAvg; //Wtd avg to deal with small movie counts };

//===================================== // Engine Class // Initialization Engine::Engine(void) { m_nRatingCount = 0; for (int f=0; f<MAX_FEATURES; f++) { for (int i=0; i<MAX_MOVIES; i++) m_aMovieFeatures[f][i] = (float)INIT; for (int i=0; i<MAX_CUSTOMERS; i++) m_aCustFeatures[f][i] = (float)INIT; } } //------------------------------------------ // Calculations - This Paragraph contains all of the relevant code // CalcMetrics // - Loop through the history and pre-calculate metrics used in the training // - Also re-number the customer id's to fit in a fixed array void Engine::CalcMetrics() { int i, cid; IdItr itr; wprintf(L" Calculating intermediate metrics "); // Process each row in the training set for (i=0; i<m_nRatingCount; i++) { Data* rating = m_aRatings + i; // Increment movie stats m_aMovies[rating->MovieId].RatingCount++; m_aMovies[rating->MovieId].RatingSum += rating->Rating; // Add customers (using a map to re-number id's to array indexes) itr = m_mCustIds.find(rating->CustId); if (itr == m_mCustIds.end()) { cid = 1 + (int)m_mCustIds.size(); // Reserve new id and add lookup m_mCustIds[rating->CustId] = cid; // Store off old sparse id for later m_aCustomers[cid].CustomerId = rating->CustId; // Init vars to zero m_aCustomers[cid].RatingCount = 0; m_aCustomers[cid].RatingSum = 0; } else { cid = itr->second; } // Swap sparse id for compact one rating->CustId = cid; m_aCustomers[cid].RatingCount++; m_aCustomers[cid].RatingSum += rating->Rating; } // Do a follow-up loop to calc movie averages for (i=0; i<MAX_MOVIES; i++) { Movie* movie = m_aMovies+i; movie->RatingAvg = movie->RatingSum / (1.0 * movie->RatingCount); movie->PseudoAvg =(3.23 * 25 + movie->RatingSum) / (25.0 + movie->RatingCount); } } // CalcFeatures - Iteratively train each feature on entire data set // - Once sufficient progress has been made, move on void Engine::CalcFeatures() { int f, e, i, custId, cnt = 0; Data* rating; double err, p, sq, rmse_last, rmse = 2.0; short movieId; float cf, mf; for (f=0; f<MAX_FEATURES; f++) { wprintf(L" --- Calculating feature: %d --- ", f); // Keep looping until you have passed a minimum number // of epochs or have stopped making significant progress for (e=0; (e < MIN_EPOCHS) || (rmse <= rmse_last - MIN_IMPROVEMENT); e++) { cnt++; sq = 0; rmse_last = rmse; for (i=0; i<m_nRatingCount; i++) { rating = m_aRatings + i; movieId = rating->MovieId; custId = rating->CustId; // Predict rating and calc error p = PredictRating(movieId, custId, f, rating->Cache, true); err = (1.0 * rating->Rating - p); sq += err*err; // Cache off old feature values cf = m_aCustFeatures[f][custId]; mf = m_aMovieFeatures[f][movieId]; // Cross-train the features m_aCustFeatures[f][custId] += (float)(LRATE * (err * mf - K * cf)); m_aMovieFeatures[f][movieId] += (float)(LRATE * (err * cf - K * mf)); } rmse = sqrt(sq/m_nRatingCount); wprintf(L" <set x='%d' y='%f' /> ",cnt,rmse); }

// Cache off old predictions for (i=0; i<m_nRatingCount; i++) { rating = m_aRatings + i; rating->Cache = (float)PredictRating(rating->MovieId, rating->CustId, f, rating->Cache, false); } } } // PredictRating - During training there is no need to loop through all of the features // - Use a cache for the leading features and do a quick calculation for the trailing // - The trailing can be optionally removed when calculating a new cache value double Engine::PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing) { // Get cached value for old features or default to an average double sum = (cache > 0) ? cache : 1; //m_aMovies[movieId].PseudoAvg; //Add contribution of current feature sum += m_aMovieFeatures[feature][movieId] * m_aCustFeatures[feature][custId]; if (sum > 5) sum = 5; if (sum < 1) sum = 1; // Add up trailing defaults values if (bTrailing) { sum += (MAX_FEATURES-feature-1) * (INIT * INIT); if (sum > 5) sum = 5; if (sum < 1) sum = 1; } return sum; } // PredictRating - This version is used for calculating the final results // - It loops through the entire list of finished features double Engine::PredictRating(short movieId, int custId) { double sum = 1; //m_aMovies[movieId].PseudoAvg; for (int f=0; f<MAX_FEATURES; f++) { sum += m_aMovieFeatures[f][movieId] * m_aCustFeatures[f][custId]; if (sum > 5) sum = 5; if (sum < 1) sum = 1; } return sum; } // Data Loading / Saving // LoadHistory // - Loop through all of the files in the training directory void Engine::LoadHistory() { WIN32_FIND_DATA FindFileData; HANDLE hFind; bool bContinue = true; int count = 0; // TEST // Loop through all of the files in the training directory hFind = FindFirstFile(TRAINING_PATH, &FindFileData); if (hFind == INVALID_HANDLE_VALUE) return; while (bContinue) { this->ProcessFile(FindFileData.cFileName); bContinue = (FindNextFile(hFind, &FindFileData) != 0); //if (++count > 999) break; // TEST: Uncomment to only test with the first X movies } FindClose(hFind); } // ProcessFile Load a history: <MovieId>:<CustomerId>,<Rating> <CustomerId>,<Rating>... void Engine::ProcessFile(wchar_t* pwzFile) { FILE *stream; wchar_t pwzBuffer[1000]; wsprintf(pwzBuffer,TRAINING_FILE,pwzFile); int custId, movieId, rating, pos = 0; wprintf(L"Processing file: %s ", pwzBuffer); if (_wfopen_s(&stream, pwzBuffer, L"r") != 0) return; // First line is the movie id fgetws(pwzBuffer, 1000, stream); ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId); m_aMovies[movieId].RatingCount = 0; m_aMovies[movieId].RatingSum = 0; // Get all remaining rows fgetws(pwzBuffer, 1000, stream); while ( !feof( stream ) ) { pos = 0; ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId); ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, rating); m_aRatings[m_nRatingCount].MovieId = (short)movieId; m_aRatings[m_nRatingCount].CustId = custId; m_aRatings[m_nRatingCount].Rating = (BYTE)rating; m_aRatings[m_nRatingCount].Cache = 0; m_nRatingCount++; fgetws(pwzBuffer, 1000, stream); } // Cleanup fclose( stream ); } // ProcessTest - Load a sample set in the following format // <Movie1Id>: <CustomerId> <CustomerId> ... <Movie2Id>: <CustomerId> // - And write results: <Movie1Id>: <Rating> <Raing> ... void Engine::ProcessTest(wchar_t* pwzFile) { FILE *streamIn, *streamOut; wchar_t pwzBuffer[1000]; int custId, movieId, pos = 0; double rating; bool bMovieRow; wsprintf(pwzBuffer, TEST_PATH, pwzFile); wprintf(L"

Processing test: %s ", pwzBuffer); if (_wfopen_s(&streamIn, pwzBuffer, L"r") != 0) return; if (_wfopen_s(&streamOut, PREDICTION_FILE, L"w") != 0) return; fgetws(pwzBuffer, 1000, streamIn); while ( !feof( streamIn ) ) { bMovieRow = false; for (int i=0; i<(int)wcslen(pwzBuffer); i++) { bMovieRow |= (pwzBuffer[i] == 58); } pos = 0; if (bMovieRow) { ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId); // Write same row to results fputws(pwzBuffer,streamOut); } else { ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId); custId = m_mCustIds[custId]; rating = PredictRating(movieId, custId); // Write predicted value swprintf(pwzBuffer,1000,L"%5.3f ",rating); fputws(pwzBuffer,streamOut); } //wprintf(L"Got Line: %d %d %d ", movieId, custId, rating); fgetws(pwzBuffer, 1000, streamIn); } // Cleanup fclose( streamIn ); fclose( streamOut ); } //---------------------------------------- // Helper Functions //---------------------------------------- bool Engine::ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut) { int count = 0; int start = nPosition; wchar_t wc = 0; // Find start of number while (start < nLength) { wc = pwzBufferIn[start]; if ((wc >= 48 && wc <= 57) || (wc == 45)) break; start++; } // Copy each character into the output buffer nPosition = start; while(nPosition<nLength&&((wc>=48&&wc<=57)||wc==69 || wc==101 || wc==45 || wc==46)) { pwzBufferOut[count++] = wc; wc = pwzBufferIn[++nPosition]; } // Null terminate and return pwzBufferOut[count] = 0; return (count > 0); } bool Engine::ParseFloat(wchar_t* pwzBuffer, int nLength, int &nPosition, float& fValue) { wchar_t pwzNumber[20]; bool bResult = ReadNumber(pwzBuffer, nLength, nPosition, pwzNumber); fValue = (bResult) ? (float)_wtof(pwzNumber) : 0; return false; } bool Engine::ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue) { wchar_t pwzNumber[20]; bool bResult = ReadNumber(pwzBuffer, nLength, nPosition, pwzNumber); nValue = (bResult) ? _wtoi(pwzNumber) : 0; return bResult; }

How do we use this theory? For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps. Xod=Fd(X)=DPPd(X) d1 x1od x1 x2 : xN x2od = - ( j=1..nXj dj)2 = i=1..N(j=1..nxi,jdj)2 xNod dn V(d)≡VarianceXod=(Xod)2 - (Xod)2 M1 M2 : MC For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk. = i(jxi,jdj) - (jXj dj) (kXk dk) (kxi,kdk) + j<kxi,jxi,kdjdk = ijxi,j2dj2 1 1 1 2 Then Xi = Mean(X)i and N N N N and XiXj = Mean Mi1 Mj1 . : +2j<kXjXkdjdk - " = jXj2 dj2 +2j<kXjXkdjdk - jXj2dj2 2a11d1 V(d)= +j1a1jdj MiC MjC XjXk)djdk ) +(2j=1..n<k=1..n(XjXk- 2a22d2 = j=1..n(Xj2 - Xj2)dj2 + +j2a2jdj : 2anndn +jnanjdj V(d) = V(d)=jajjdj2 ijaijdidj + jkajkdjdk subject to i=1..ndi2=1 dTo A o d = V(d) d1 : dn V i XiXj-XiX,j : d1 ... dn V(d)≡Gradient(V)=2Aod 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n : ' 2an1 ... 2ann d1 : di : dn or Ubhaya Theorem1:  k{1,...,n} s.t. d=ek will hill-climb V to its globally max. Theorem2 (working on it): Let d=ek s.t. akk is a maximal diagonal element of A, d=ek will hill-climb V to its globally maximum. Maximizing theVariance Given any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means. FAUST Classifier MVDI (Maximized Variance Definite Indefinite: Build a Decision tree. 1. Find d that maximizes variance of dot product projections of class means each round. 2. Apply DI each round FAUST technology relies on: 1. a distance dominating functional, F. 2. Use of gaps in range(F) to separate. We can separate out the diagonal or not: For Unsupervised (Clustering) Hierarchical Divisive? Piecewise Linear? other? Perf Anal (which approach is best for which type of table?) For Supervised (Classification), Decision Tree? Nearest Nbr? Piecewise Linear? Perf Anal (which is best for training set?) d1≡(V(d0));  d0, one can hill-climb it to locally maximize the variance, V, as follows: d2≡(V(d1)):... where White papers: Terabyte Head Wall. The Only Good Data is Data in Motion Multilevel pTrees: k=0,1 suffices! A PTreeSet is defined by specifying a table, an array of stride_lengths (usually equi-length so just that one length is specified) and a stride_predicate (T\F condition on a stride (stride=bag [or array?] of bits): So the metadata of PTreeSet(T,sl,sp) specifies T, sl and sp. A “raw” PTreeSet has sl=1 and the identity predicate (sl and sp not used). A “cooked” PTreeSet (AKA Level-1 PTreeSet) for a table with sl1 (main purpose: provide compact summary information on the table.) Let PTS(T) be a raw PTreeSet, then it, plus PTS(T,64,p), ..., PTS(T,64^k,p) form a tree of vertical summarizations of T. Note that P(T, 64*64, p) is different from P(P(T,64,p), 64, p), but both make sense since P(t, 64, p) is a table and P(P(T, 64, p), 64, p) is just a cooked pTree on it.

FAUST MVDI (-1, 16.5=avg{23,10})s sCt=50 (16.5, 38)e eCt=24 (48.128)i iCt=39 d=(.33, -.1, .86, .38) (-1,8)e Ct=21 (10,128)i Ct=9 indef[38, 48]se_i seCt=26 iCt=13 indef[8,10]e_i eCt=5 iCt=4 Definite Indefinite i-Mean 62.8 29.2 46.1 14.5 i -1 8 e-Mean 59 26.9 49.6 18.4 e 10 17 i_e 8 10 empty d=(-.55, -.33, .51, .57) d0=(.33, -.1, .86,.38) 16.5  xod0 < 38 xod0 < 16.5 38  xod0 48 48 < xod0 Setosa Virginica Versicolor d1=(-.55, -.33, .51, .57) xod1 < 9 xod1 9 Virginica Versicolor on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.) Definite_____ Indefinite s-Mean 50.49 34.74 14.74 2.43 s -1 10 e-Mean 63.50 30.00 44.00 13.50 e 23 48 s_ei 23 10 empty i-Mean 61.00 31.50 55.50 21.50 i 38 70 se_i 38 48 In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree:

FAUST MVDI SatLog 413train 4atr 6cls 127test Using class means: FoMN Ct min max max+1 mn4 83 101 104 82 113 8 110 121 122 mn3 85 103 108 85 117 79 105 128 129 mn1 69 106 115 94 133 12 123 148 149 Using full data: (much better!) mn4 83 101 104 82 59 8 56 65 66 mn3 85 103 108 85 62 79 52 74 75 mn1 69 106 115 94 81 12 73 95 96 d=(0.39 0.89 0.35 0.10 ) F[a,b) 0 92 104 118 127 146 156 157 161 179 190 Class 2 2 2 2 2 2 5 5 5 5 7 7 7 7 7 7 1 1 1 1 1 1 1 4 4 4 4 4 3 3 3 3 d=(-.11 -.22 .54 .81) F[a,b) 89 102 Class 5 2 d=(-.15 -.29 .56 .76) F[a,b) 47 65 81 101 Class 7 5 5 2 2 d=(-.81, .17, .45, .33) F[a,b) 21 3541 59 Class 3 1 d=(-.01, -.19, .7, .69) d=(-.66, .19, .47, .56) F[a,b) 57 6169 87 Class 5 7 F[a,b) 5256667375 Class 333 3 4 11 cl=4 cl=7 Cl=7 Gradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd) 0.00 0.00 1.00 0.00 282 0.13 0.38 0.64 0.65 700 0.20 0.51 0.62 0.57 742 0.26 0.62 0.57 0.47 781 0.30 0.70 0.53 0.38 810 0.34 0.76 0.48 0.30 830 0.36 0.79 0.44 0.23 841 0.37 0.81 0.40 0.18 847 0.38 0.83 0.38 0.15 850 0.39 0.84 0.36 0.12 852 0.39 0.84 0.35 0.10 853 Fomn Ct min max max+1 mn2 49 40 115 119 106 108 91 155 156 mn5 58 58 76 64 108 61 92 145 146 mn7 69 77 81 64 131 154 104 160 161 mn4 78 91 96 74 152 60 127 178 179 mn1 67 103 114 94 167 27 118 189 190 mn3 89 107 112 88 178 155 157 206 207 Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) 0.00 0.00 0.00 1.00 1137 -0.11 -0.22 0.54 0.81 1747 MNod Ct ClMn ClMx ClMx+1 mn2 45 33 115 124 150 54 102 177 178 mn5 55 52 72 59 69 33 45 88 89 Gradient Hill Climb of Var(d)on t257 0.00 0.00 1.00 0.00 496 -0.15 -0.29 0.56 0.76 1595 Same using class means or training subset. Gradient Hill Climb of Var(d)on t75 0.00 0.00 1.00 0.00 12 0.04 -0.09 0.83 0.55 20 -0.01 -0.19 0.70 0.69 21 Gradient Hill Climb of Var(d)on t13 0.00 0.00 1.00 0.00 29 -0.83 0.17 0.42 0.34 166 0.00 0.00 1.00 0.00 25 -0.66 0.14 0.65 0.36 81 -0.81 0.17 0.45 0.33 88 On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy. speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread). With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet. Gradient Hill Climb of Var(d)on t143 0.00 0.00 1.00 0.00 19 -0.66 0.19 0.47 0.56 95 0.00 0.00 1.00 0.00 27 -0.17 0.35 0.75 0.53 54 -0.32 0.36 0.65 0.58 57 -0.41 0.34 0.62 0.58 58 For WINE: min max+1 8.40 10.33 27.00 9.63 28.65 9.9 53.4 7.56 11.19 32.61 10.38 34.32 7.7 111.8 8.57 12.84 30.55 11.65 32.72 8.7 108.4 8.91 13.64 34.93 11.97 37.16 13.1 92.2 Awful results! Gradient Hill Climb of Var t156161 0.00 0.00 1.00 0.00 5 -0.23 -0.28 0.89 0.28 19 -0.02 -0.06 0.12 0.99 157 0.02 -0.02 0.02 1.00 159 0.00 0.00 1.00 0.00 1 -0.46 -0.53 0.57 0.43 2 Inconclusive both ways so predict purality=4(17) (3ct=3 tct=6 Gradient Hill Climb of Var t146156 0.00 0.00 1.00 0.00 0 0.03 -0.08 0.81 -0.58 1 0.00 0.00 1.00 0.00 13 0.02 0.20 0.92 0.34 16 0.02 0.25 0.86 0.45 17 Inconclusive both ways so predict purality=4(17) (7ct=15 2ct=2 Gradient Hill Climb of Var t127 0.00 0.00 1.00 0.00 41 -0.01 -0.01 0.70 0.71 90 -0.04 -0.04 0.65 0.75 91 0.00 0.00 1.00 0.00 35 -0.32 -0.14 0.59 0.73 105 Inconclusive predict purality=7(62 4(15) 1(5) 2(8) 5(7)

FAUST MVDI Concrete d0= -0.34 -0.16 0.81 -0.45 xod3<969 xod0<320 xod2<28 xod>=19.3 xod2>=662 xod2>=92 xod0>=634 xod>=18.6 d1= .85 -.03 .52 -.02 d2= .85 -.00 .53 .05 Class=m (test:1/1) Class= l or m Cl=l *test 6/9) Class=m errs0/1) Class=m errs8/12) Cl=h (test:11/12) Class=m errs0/4) Class=m errs0/0) Class=l (test:1/1) Class=m (test:2/2) xod<13.2 xod<13.2 .00 .00 1.00 .00 1.0 8.0 6 4 l 4.0 5.0 0 0 m 2.0 9.0 0 0 h 0 2 2 99 .97 .19 .08 .16 d1 13.4 19.6 0 0 l 16.9 19.9 4 3 m 13.5 16.0 0 0 h 0 13.45 18.6 99 0.97 0.19 0.06 0.15 14.4 19.6 0 0 l 16.8 18.8 0 0 m 13.5 15.8 11 1 h 0 14.366 17.816 99 Class=l errs:0/4) Class=h errs:0/5) Class=h errs:0/5) Class=h errs:0/1) d3= .81 .04 .58 .01 xod4>=681 xod3>=868 Cl=m (test:1/1) Cl=l (test:0/3) d4 = .79 .14 .60 .03 xod4<640 Cl=l *test 2/2) xod3<544 Cl=m *test 0/0) 7 test errors / 30 = 77% For Concrete min max+1 train 335.3 657.1 0 l 120.5 611.6 12 m 321.1 633.5 0 h Test 0 l ****** 1 m ****** 0 h ****** 0 321 3.0 57.0 0 l 3.0 361.0 11 m 28.0 92.0 0 h 0 l ***** 2 m ***** 0 h 92 ***** 999 .97 .17 -.02 .15 d0 13.3 19.3 0 0 l 16.4 23.5 0 0 m 12.2 15.2 25 5 h 0 13.2 19.3 23.5 Seeds d3 547.9 860.9 4 l 617.1 957.3 0 m 762.5 867.7 0 h 0 l ******* 0 m ******* 0 h . 0 ******* 617 8 test errors / 32 = 75% d2 544.2 651.5 0 l 515.7 661.1 0 m 591.0 847.4 40 h 1 l ****** 0 m ****** 11 h 662 ****** 999

0. Cut in middle of the means: a= (mR+(mV-mR)/2)od = (mR+mV)/2od D≡mRmVd=D/|D| PR=Pxod<a PV=Pxoda 5. PR=Pxod<CutR PV=Pxod>CutV Min{Vod}Max{Rod} CutR=CutV=avg{minVod,minRod}, else CutR≡Min{Vod}, Cut≡Max{Rod} vomR vomV MnVod V MaxRod R d2-line d-line d d2 a FAUST Classifier 1. Cut in the middle of:VectorOfMedians (VOM), not the means. Use stdev ratio not middle for even better cut placement? 2. Cut in the middle of{Max{Rod},Min{Vod}. (assuming mRodmVod) If no gap, move cut to minimize Rerrors + Verrors. 3. Hill-climb d to maximize gap or to minimize training set errors or (simplest) to minimize dis(max{rod},min{vod}) . 4. Replace mr, mv with the avg of the margin points? y PR or yPV , Definite classifications; else re-do on Indefinite region,PCutRxodCutV until actual  gap (AND with certain stop cond? E.g., "On nth round, use definite only (cut at midpt(mR,mV)." Another way to view FAUST DI is that it is a Decision Tree Method. With each non-empty indefinite set, descend down the tree to a new level For each definite set, terminate the descent and make the classification. dim 2 Each round, it may be advisable to go through an outlier removal process on each class before setting Min{Vod} and Max{Rod} (E.g., Iteratively check if F-1(Min{Vod}) consists of V-outliers). rvv rmRrv v v v r rv mV v rv v r v dim 1

FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK): Let mi≡meanCi s.t. dom1dom2 ...domKMni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj} Definitei = ( Mx<i, Mn>i ) Indefinitei,i+1 = [ Mn>i, Mx<i+1 ] Then recurse on each Indefinite. For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEANsMEANe Definite_____ Indefinite__ s-Mean 50.49 34.74 14.74 2.43 s -1 25 e-Mean 63.50 30.00 44.00 13.50 e 10 37 se 25 10 empty i-Mean 61.00 31.50 55.50 21.50 i 48 128 ei 37 48 F < 18  setosa (35 seto) 1ST ROUND D=MeansMeane 18 < F < 37  versicolor (15 vers) 37  F  48  IndefiniteSet2 (20 vers, 10 virg) 48 < F  virginica (25 virg) F < 7  versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7  F  10  IndefSet3 ( 3 vers, 5 virg) 10 < F  virginica ( 0 vers, 5 virg) F < 3  versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3  F  7  IndefSet4 ( 2 vers, 1 virg) Here we will assign 0  F  7 versicolor 7 < F  virginica ( 0 vers, 3 virg) 7 < F virginica Test: F < 15  setosa (15 seto) 1ST ROUND D=MeansMeane 15 < F < 15  versicolor ( 0 vers, 0 virg) 15  F  41  IndefiniteSet2 (15 vers, 1 virg) 41 < F  virginica ( 14 virg) F < 20  versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F  virginica ( 0 vers, 1 virg) 100% accuracy. Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k=1... (and Mean could be replaced by VOM or?) Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k=1... (and Mean could be replaced by VOM or?) Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?) Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster. Option-4: D seq.: always pick the means pair which are furthest separated from each other. Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(meani), F(meanj) Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially, IS=X)

FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. Cl=1 2 3 0 0 0 0 0 1 Cls3 outlier (F=0) Cl=1 2 3 0 0 0 1 0 0 Cls1 outlier (F=29) Cl=1 2 3 0 0 0 0 0 0 done! declare Class=1 Cl=1 2 3 0 0 0 1 0 0 Cls1 outlier(F=54) Cl=1 2 3 5 0 2 Cl=1 2 3 5 0 3 Cl=1 2 3 6 0 3 Cl=1 2 3 5 0 2 m1 13.2 5.2 4.0 5.0 9 avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m3 13.0 5.0 4.0 5.0 6 avF3 def1[ 13 inf ) in11[ 0 13 ) m1 13.0 5.2 3.6 5.0 13 avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m3 13.0 5.0 4.0 5.0 9 avF3 def1[ 19 inf ) in1111[ 9 19 ) m1 13.0 5.1 3.7 5.0 30 avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m3 13.0 5.0 4.0 5.0 27 avF3 def1[ 37 inf ) in11[ 0 37 ) m1 13.0 5.2 3.6 5.0 13 avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m3 13.0 5.0 4.0 5.0 9 avF3 def1[ 19 inf ) in111[ 9 19 ) On Indef-11 On Indef-111 On Indef-1111 On Indef-1 Option-4, means pair most separated in X. m1 14.4 5.6 2.7 5.1 4.4 d(m1,m2) DEFINITE INDEFINITE m2 18.6 6.2 3.7 6.0 3.4 d(m1,m3) 2 -inf 0 m3 11.8 5.0 4.7 5.0 7.0 d(m2,m3) 1 106 0 12 0 106 0  F  106, 3 106 inf 23 0 106 so totally non-productive! Option-6: D Median-to-Mean of IndefSet (initially IS=X) m1 14.4 5.6 2.7 5.1 37.3 meanF1 DEFINITE Cl=1 2 3 INDEFINITE m2 18.6 6.2 3.7 6.0 71.2 meanF2 def3[ -inf 21) 0 0 32 m3 11.8 5.0 4.7 5.0 `2.0 meanF3 def1[ 28 49) 22 0 0 ind1[ 21 28 ) On whole TR def2[ 58 inf) 0 30 0 ind2[ 49 58 )

FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. D Mean(loF)-to-Mean(hiF) of IndefSet12 D Mean(loF)-to-Mean(hiF) of IndefSet313131 (d repeats after this so=C1 D Mean(loF)-to-Mean(hiF) of IndefSet31 D Mean(loF)-to-Mean(hiF) of IndefSet1313 Cl=1 2 3 5 0 0 0 5 0 Cl=1 2 3 0 0 1 1 0 0 Cl=1 2 3 1 0 0 0 0 0 Cl=1 2 3 0 0 0 1 0 0 The rest, Class=1 Cl=1 2 3 . 5 0 2 Cl=1 2 3 . 4 0 2 Cl=1 2 3 . 0 0 0 Cl=1 2 3 . 6 0 3 m1 16.2 6.0 1.8 5.2 5.8 avF1 DEFINITE INDEFINITE m2 16.6 6.0 4.6 6.0 6.2 avF2 def1[ -inf 2 ) def2[ 15 inf ) in1212[ 2 15 ) m1 12.8 5.2 3.2 5.0 18 avF1 DEFINITE INDEFINITE m3 13.0 5.0 4.0 5.0 10 avF3 def3[ -inf 10 ) . def1[ 20 inf ) in313131[ 10 20 ) . m1 13.0 5.1 3.7 5.0 30 avF1 DEFINITE INDEFINITE m3 13.0 5.0 4.0 5.0 27 avF3 def1[-inf 18 ) . def3[ 55 inf ) in1313[ 18 55 ) m1 13.0 5.2 3.6 5.0 4 avF1 DEFINITE INDEFINITE m3 13.0 5.0 3.5 5.0 2 avF3 def1[ -inf 0 ) def3[ 5 inf ) C1= [ 0 5 ) Option-6: D Median-to-Mean of X m1 14.4 5.6 2.7 5.1 37.3 meanF1 DEFINITE Cl=1 2 3 INDEFINITE m2 18.6 6.2 3.7 6.0 71.2 meanF2 def3[ -inf 21) 0 0 32 m3 11.8 5.0 4.7 5.0 `2.0 meanF3 def1[ 28 49) 22 0 0 ind31[ 21 28 ) On whole TR def2[ 58 inf) 0 30 0 ind12[ 49 58 ) [-inf, 21)class=3 [28, 49)class=2 [58.inf) class=3 d=(.,9, -,1, -.2, -.2) [21,28)ind31 d=(-.9, -.1, .14, -.1)[49, 58)ind12 d=(0, .31, -.9, 0) [-inf,18)def[49, 58)ind23

Xod=Fd(X)=DPPd(X) d1 x1od x1 x2 : xN x2od = - ( j=1..nXj dj)2 = i=1..N(j=1..nxi,jdj)2 xNod dn V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2 = i(jxi,jdj) - (jXj dj) (kXk dk) (kxi,kdk) + j<kxi,jxi,kdjdk sub to i di2=1 = ijxi,j2dj2 Maximize wrt d, |Mean(DPPd(X)) - Median(DPPd(X)| Mean(DPPdX)=(1/N)i=1..Nj=1..nxi,jdj = j=1..n Xjdj =j (1/Nixi,j ) dj 1 2 1 1 N N N N +2j<kXjXkdjdk - " = jXj2 dj2 +2j<kXjXkdjdk - jXj2dj2 2a11d1 V(d)= +j1a1jdj do=ek s.t. akk is max or d0k=akk d1≡(V(d0)) d2≡(V(d1)) til F(dk) XjXk)djdk ) +(2j=1..n<k=1..n(XjXk- 2a22d2 = j=1..n(Xj2 - Xj2)dj2 + +j2a2jdj : 2anndn +jnanjdj V(d) = V(d)=jajjdj2 ijaijdidj + jkajkdjdk subject to i=1..ndi2=1 dTo VX o d = VarDPPdX≡V d1 : dn V i XiXj-XiX,j : d1 ... dn MEDIAN picks out last 2 sequences which have best gaps (discounting outlier gaps at the extremes) and it discards 1,3,4 which are not so good. Finding good unit vector, d, for Dot Prod functional, DPP. to maximize gaps GRADIENT(V) = 2A o d 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n : ' 2an1 ... 2ann d1 : di : dn Compute Median(DPPd(X)? Want to use only pTree processing. Want a formula in d and numbers only (like the one above for the mean (involves only the vector d and the numbers X1 ,..., Xn ) FAUST CLUSTERING Use DPPd(x), but which unit vector, d*, provides the best gap(s)? 1. DPPd exhaustively searches a grid of d's for the best gap provider. 2. Use some heuristic to choose a good d? GV: Gradient-optimized Variance MM: Use the d that maximizes |MedianF(X)-Mean(F(X))|. We have Avg as a function of d. Median? (Can you do it?) HMM: Use a heuristic for MedianF(X): F(VectorOfMedians)=VOMod MVM: Use D=MEAN(X)VOM(X), d=D/|D| Maximize variance - is it wise? 0 0 0 0 0 0 0 0 1 0 5 0 0 0 0 0 2 0 5 2 0 0 0 0 3 0 5 2 3 0 0 0 4 0 5 4 3 6 0 0 median 5 0 5 4 3 6 9 0 6 0 5 6 6 6 9 10 7 0 5 6 6 6 9 10 8 0 5 8 6 9 9 10 9 0 5 8 9 9 9 10 10 10 10 10 10 10 10 10 std 3.16 2.87 2.13 3.20 3.35 3.82 4.57 4.98 variance 10.0 8.3 4.5 10.2 11.2 14.6 20.9 24.8 Avg 5.00 0.91 5.00 4.55 4.18 4.73 5.00 4.55 consecutive 1 0 5 0 0 0 0 0 differences 1 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 0 0 2 0 6 0 0 1 0 0 0 0 0 9 0 1 0 0 2 3 0 0 10 1 0 0 0 0 0 0 0 1 0 0 2 0 3 0 0 1 0 0 0 3 0 0 0 1 10 5 2 1 1 1 0 avgCD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 maxCD 1.00 10.00 5.00 2.00 3.00 6.00 9.00 10.00 ||mean-VOM| 0.00 0.91 0.00 0.55 1.18 1.27 4.00 4.55

FAUST Clustering, simple example: Gd(x)=xod Fd(x)=Gd(x)-MinG on a dataset of 15 image points 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf Fp=MN,q=z1=0 F=1 F=2 X x1 x21 2 3 4 5 6 7 8 9 a b 1 1 1 1=q 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 p d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 The 15 Value_Arrays (one for each q=z1,z2,z3,...) z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 za 0 1 2 3 4 5 7 11 12 13 zb 0 1 2 3 4 6 8 10 11 12 zc 0 1 2 3 5 6 7 8 9 11 12 13 zd 0 1 2 3 7 8 9 10 ze 0 1 2 3 5 7 9 11 12 13 zf 0 1 3 5 6 7 8 9 10 11 The 15 Count_Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 za 2 1 1 1 1 1 4 1 1 2 zb 1 2 1 1 3 2 1 1 1 2 zc 1 1 1 2 2 1 1 1 1 1 1 2 zd 3 3 3 1 1 1 1 2 ze 1 1 2 1 3 2 1 1 2 1 zf 1 2 1 1 2 1 2 2 2 1 gap: [F=6, F=10] gap: [F=2, F=5] pTree masks of the 3 z1_clusters (obtained by ORing) z11 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 z12 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 z13 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

What have we learned? What is the DPPd FAUST CLUSTER algorithm? X2=SubCluster2 SubCluster1 D=MedianMean, d1≡D/|D| is a good start. But first, Variance-Gradient hill-climb it. (Median means Vector of Medians). For X2=SubCluster2 use a d2 which is perpendicular to d1? In high dimensions, there are many perpendicular directions. GV hill-climb d2=D2/|D2| (D2=MedianX2-MeanX2) constrained to be  to d1, i.e., constrained to d2od1=0 (in addition to d2od2=1. We may not want to constrain this second hill-climb to unit vectors perpendicular to d1. It might be the case that the gap gets wider using a d2 which is not perpendicular to d1? GMP:Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) Variance-Gradient hill-climbed subject only to dod=1 (We shouldn't constrain the 2nd hill-climb to d1od2=0 and subsequent hill-climbs to dkodh=0, h=2...k-1. (gap could be larger). So the 2nd round starts at d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) and hill-climbs subject only to dod=1) GCCP:Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where D2=CCi(X2)-CCj(X2), and hill-climbs subject to dod=1, where the CCs are two of the Circumscribing rectangle's Corners (the CCs may be a faster calculations than Mean and Vom). Taking all edges and diagonals of CCR(X) (the Coordinate-wise Circumscribing Rectangle of X) provides a grid of unit vectors. It is an equi-spaced grid iff we use a CCC(X) (Coordinate-wise Circumscribing Cube of X). Note that there may be many CCC(X)s. A canonical one is the one that is furthest from the origin (take the longest side first. Extend each other side the same distance from the origin side of that edge. A good choice may be to always take the longest side of CR(X) as D, D≡LSCR(X). Should outliers on the (n-1)-dim-faces at the ends of LSCR(X) be removed first? So remove all LSCR(X)-endface outliers until after removal the same side is still the LSCR(X). Then use that LSCR(X) as D.

MVM C11 F-MN gp2 0 1 1 1 1 1 2 3 1 3 3 2 5 2 1 6 1 2 8 2 2 10 2 1 11 1 1 12 4 1 13 1 2 15 2 WINE GV GM ACCURACY WINE GV 62.7 MVM 66.7 GM 81.3 .11 .19 .96 .19 209 -.02 .41 .91 0 232 C1(F-MN) gp3 0 1 1 1 6 1 2 5 1 3 2 1 4 4 1 5 8 1 6 8 1 7 4 1 8 3 1 9 7 1 10 1 1 11 4 1 12 6 1 13 4 1 14 2 1 15 3 1 16 3 1 17 2 1 18 2 1 19 3 1 20 4 1 21 6 1 22 4 1 23 1 1 24 2 1 25 4 1 26 1 1 27 1 2 29 2 1 30 2 2 32 1 3 35 1 1 36 1 1 37 1 1 38 1 1 39 4 1 40 2 2 42 2 2 44 1 1 45 2 2 47 4 1 48 2 1 49 1 1 50 1 3 53 1 1 54 2 1 55 2 [0.12) 1L 0H F-MN Ct gp8 0 1 12 12 1 3 15 2 13 28 1 2 30 1 2 32 2 2 34 1 1 35 2 3 38 1 8 46 1 1 47 3 10 57 1 1 58 1 1 59 1 1 60 1 2 62 1 2 64 1 1 65 1 1 66 1 1 67 4 1 68 2 1 69 1 1 70 1 2 72 3 1 73 1 1 74 3 1 75 2 1 76 1 1 77 1 2 79 1 3 82 1 1 83 1 1 84 2 1 85 1 1 86 1 2 88 2 1 89 4 1 90 2 1 91 1 1 92 6 1 93 3 1 94 5 1 95 4 2 97 5 1 98 2 1 99 1 1 100 4 1 101 7 1 102 4 1 103 2 1 104 3 1 105 6 1 106 3 1 107 8 1 108 10 1 109 2 1 110 4 1 111 5 1 112 2 1 113 4 1 114 1 .07 .15 .98 .12 588 -.01 .26 .97 .00 608 (F-MN) gp8 0 1 1 1 4 1 2 4 1 3 5 1 4 4 1 5 6 1 6 8 1 7 6 1 8 4 1 9 5 1 10 2 1 11 3 1 12 7 1 13 4 1 14 3 1 15 2 1 16 2 1 17 3 1 18 4 1 19 3 1 20 4 1 21 1 1 22 7 1 23 2 1 24 4 1 25 1 1 26 1 1 27 1 1 28 1 1 29 1 1 30 1 1 31 1 1 32 1 3 35 1 2 37 3 1 38 1 1 39 1 1 40 3 1 41 3 3 44 2 1 45 2 1 46 4 1 47 2 2 49 1 2 51 1 1 52 1 3 55 1 1 56 1 1 57 1 9 66 2 1 67 2 8 75 1 4 79 2 1 80 1 2 82 2 1 83 1 2 85 1 13 98 1 2 100 1 3 103 1 11 114 1 -.05 -.31 -.95 -.01 605 .01 -.27 -.96 -.0 608 XF-M gp3 0 1 11 11 1 4 15 1 1 16 1 13 29 1 1 30 1 2 32 2 2 34 1 1 35 2 4 39 1 8 47 2 1 48 2 9 57 1 1 58 1 1 59 1 2 61 1 2 63 1 2 65 1 1 66 1 1 67 1 1 68 5 1 69 2 1 70 1 3 73 3 1 74 3 1 75 1 1 76 2 1 77 2 2 79 1 3 82 1 1 83 1 1 84 1 1 85 1 1 86 1 1 87 1 1 88 1 1 89 1 1 90 4 1 91 2 1 92 7 1 93 1 1 94 5 1 95 4 1 96 2 1 97 3 1 98 2 1 99 2 1 100 3 1 101 4 1 102 7 1 103 3 1 104 2 1 105 6 1 106 3 1 107 5 1 108 9 1 109 6 1 110 4 1 111 5 1 112 4 1 113 4 1 114 1 _4L2H ___ _ [12,28) 1L2H _2L1H 2L 0H C1 -.11 -.02 -.86 .5 43 -.05 -.4 -.92 .01 68 C7F-M*3 g3 0 1 3 3 1 2 5 1 4 9 2 6 15 1 3 18 1 2 20 1 1 21 1 1 22 3 3 25 2 3 28 1 2 30 3 1 31 2 1 32 1 1 33 1 1 34 2 1 35 1 1 36 3 3 39 2 1 40 1 1 41 2 3 44 1 2 46 3 2 48 1 2 50 2 2 52 1 1 53 1 1 54 1 1 55 1 1 56 2 1 57 1 1 58 2 1 59 2 1 60 2 1 61 1 1 62 1 1 63 2 2 65 1 1 66 1 1 67 1 1 68 1 1 69 3 1 70 1 1 71 1 1 72 1 1 73 1 1 74 2 1 75 2 1 76 1 1 77 3 1 78 4 1 79 3 1 80 4 1 81 1 1 82 1 1 83 1 2 85 2 1 86 2 2 88 3 1 89 2 2 91 2 2 93 4 3 96 1 _0L 2H _0L 2H C2 3L5H -.08 .59 -.8 -.07 80 .08 .83 -.56 -.01 95 C5 g3 0 1 4 4 1 8 12 1 3 15 1 2 17 1 2 19 1 4 23 1 1 24 1 2 26 1 1 27 1 2 29 3 2 31 1 1 32 1 1 33 1 ___ _ [28,46) 2L6H 1L1H ___ _ [46,57) 2L2H .05 .59 -.293 .75 18 -.1 .9 -.3 .1 34 C6*8 16 0 1 4 4 2 16 20 1 11 31 1 37 68 1 15 83 1 15 98 1 8 106 1 11 117 1 1 118 2 _2L4H C71 C121 max thin 0 1 1 1 6 1 2 5 1 3 3 1 4 3 1 5 8 1 6 8 1 7 4 1 8 7 1 9 3 1 10 1 1 11 5 1 12 6 1 13 3 1 14 2 1 15 3 1 16 3 1 17 4 _2L5H C3 _0L 1H C4 _3L 0H C1 F-M Ct g3 0 1 1 1 2 1 2 2 3 5 1 1 6 1 1 7 4 1 8 2 2 10 2 1 11 1 2 13 2 1 14 1 1 15 1 1 16 5 2 18 1 2 20 2 3 23 1 1 24 1 1 25 1 1 26 2 2 28 1 1 29 1 1 30 5 1 31 2 1 32 1 1 33 4 1 34 5 1 35 4 1 36 4 1 37 2 1 38 3 1 39 3 1 40 2 1 41 4 1 42 3 1 43 5 1 44 3 1 45 4 1 46 5 1 47 4 1 48 3 1 49 11 1 50 5 1 51 3 1 52 5 1 53 4 1 54 4 1 55 1 _1L2H ___ 4L2H _0L 1H C4 _1L 2H ___ 0L 2H _2L 3L2H 23L 25H 6L 21H _1L6H 5L5H _2L 12H _9L 7H C5 .19 .8 -.54 .18 7 -.21 .7 -.7 -.09 9 C763F-M*8 g8 0 2 16 16 1 13 29 1 12 41 2 4 45 1 7 52 1 4 56 1 7 63 1 8 71 2 _1L4H .01 -.27 -.96 -.01 23 -.04 -.43 -.9 .03 24 C76*4 g3 0 1 31 31 1 3 34 1 1 35 2 2 37 1 2 39 2 2 41 1 2 43 1 1 44 1 2 46 3 3 49 1 1 50 1 1 51 2 1 52 1 1 53 2 2 55 2 2 57 2 3 60 1 2 62 2 1 63 1 2 65 3 1 66 2 3 69 1 1 70 1 1 71 2 2 73 2 1 74 1 1 75 2 1 76 3 1 77 2 1 78 2 1 79 3 1 80 3 2 82 1 2 84 1 2 86 2 1 87 1 1 88 1 2 90 2 1 91 2 3 94 2 2 96 2 1 97 2 C11 10L 13H C12 0L 2H ___ _1L 0H ___ _0L 1H _2L 0H [0.35) C11 38L68H C12 F-M gp2 0 1 1 1 8 1 2 3 1 3 2 1 4 4 1 5 11 1 6 8 1 7 2 1 8 6 1 9 4 1 10 3 1 11 4 1 12 4 1 13 5 1 14 3 1 15 3 1 16 4 2 18 2 1 19 5 1 20 6 1 21 4 1 22 1 1 23 2 1 24 3 1 25 3 3 28 2 1 29 2 2 31 1 4L 8H C6 _2L4H 4L8H C763 _0L 2H -.21 .34 -.91 .9 8 C766 *16 g4 0 1 30 30 1 2 32 1 7 39 1 1 40 1 1 41 1 1 42 1 4 46 1 2 48 1 2 50 2 5 55 1 3 58 1 7 65 1 2 67 1 5 72 1 3 75 2 2 77 1 1 78 4 2 80 1 3 83 1 1 84 2 4 88 1 1 89 1 11 100 1 4 104 1 11 115 1 _0L 1H [35,53) C12 10L13H ___ _ 2L9H _2L 0H ___ [53,56) 3L 2H _3L 1H 29L 46H ___ _ 1L8H 51L 83H [0.66) C1 _1L 3H ___ _ [66,75) 2L2H _2L 0H _2L 0H 7L 19H 2L2H ___ _ [75,98) 2L6H 0L 1H ___ [57,115) 51L 83H C1 _4L 8H ___ _ [98,115) 2L2H 17L 15H C766 _2L 0H 38L 68H C7 _0L 2H ___ 28L 44H C76 1L _1L 0H ___ _ 3L3H

SEEDS GV MVM 256 36 10 32 akk .98 .14 .04 .12 0 .00 -.00 .96 .29 3 C6 10(F-M) g12 0 3 10 10 1 12 22 3 10 32 3 9 41 2 7 48 1 ACCURACY SEEDS WINE GV 94 62.7 MVM 93.3 66.7 GM 96 81.3 219 31 14 29 akk d1 d2 d3 d4 V(d .98 .14 .06 .13 9 .98 .14 .06 .13 9 10(F-MN) gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10 102 1 1 103 2 1 104 1 10(F-MN)gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10 102 1 1 103 2 1 104 1 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [9,18) 1k 0r 24c C2 GM ___ ___ [9,18) 1k 0r 24c C2 .794 -.403 -.304 .337 6 0.957 .156 -.205 .132 9 10(F-MN) gp3 0 1 2 2 1 2 4 4 2 6 3 2 8 7 2 10 2 2 12 1 2 14 1 2 16 10 2 18 10 1 19 2 3 22 2 1 23 2 1 24 1 1 25 1 2 27 4 2 29 4 2 31 4 2 33 2 5 38 3 1 39 3 2 41 7 2 43 2 2 45 2 1 46 1 2 48 1 1 49 1 1 50 4 2 52 5 1 53 1 1 54 3 3 57 2 2 59 3 2 61 3 1 62 1 2 64 3 2 66 3 3 69 5 7 76 1 2 78 2 2 80 2 2 82 4 2 84 1 2 86 1 2 88 4 1 89 1 1 90 8 2 92 5 11 103 2 1 104 1 1 105 1 1 106 1 2 108 1 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [38,49) 13k2r 0c C5 -.577 .577 .577 .000 1 .119 .112 .986 .000 3 C2: 10(F-MN) gp10 0 1 10 10 2 1 11 3 10 21 3 10 31 5 10 41 1 10 51 1 11 62 1 1 63 1 ___ ___ [0,22) 0k 0r 42c C1 ___ ___ [38,49) 13k2r 0c C5 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [0,31) 9k 0r 0c C21 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [31,41) 1k 0r 4c C22 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [22,33) 10k 0r 8c C2 ___ ___ [41,64) 0k 0r 4c C23 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [102,105) 0k 4r 0c Cb C3 200(F-MN)gp12 0 2 12 12 3 12 24 3 12 36 5 12 48 1 12 60 1 12 72 1 40 112 2 ___ ___ [102,105) 0k 4r 0c Cb ___ ___ [33,57) 33k2r 0c C3 C3 .97 .15 .09 .14 0 0 .07 1 0 4 10F-M g9 0 2 10 10 3 10 20 3 10 30 4 1 31 1 9 40 1 10 50 1 11 61 1 9 70 2 ___ ___ [0,35) 8k 0r 0c ___ ___ [0,10) 2k 0r 0c ___ ___ [35,48) 2k 0r 3c ___ ___ [10,20) 2k 0r 1c -.832 -.282 .134 -.458 0 -.44 .00 -.87 -.22 2 C4: 10(F-MN) gp21 0 3 11 11 2 20 31 3 21 52 3 27 79 1 20 99 3 ___ ___ [20,30) 2k 0r 1c ___ ___ [48,72) 0k 0r 2c ___ ___ [57,69) 6k 9r 0c C4 ___ ___ [69,76) 1k4r 0c C6 ___ ___ [30,40) 4k 0r 1c ___ ___ [72,113) 0k 0r 3c ___ ___ [40,50) 0k 0r 1c ___ ___ [50,61) 0k 0r 1c ___ ___ [61,70) 0k 0r 1c ___ ___ [0,52) 1k7r C41 ___ ___ [70,71) 0k 0r 2c ___ ___ [52,79) 1k2r C42 C6 200(F-MN)gp12 0 3 12 12 1 38 50 3 10 60 1 2 62 3 12 74 2 ___ ___ [79100) 4k 0r C43 ___ ___ [0,50) 4k 0r 0c ___ ___ [50,60) 1k 0r 2c ___ ___ [76,103) 0k 26r 0c C7 ___ ___ [0,22) 4k 0r 0c ___ ___ [60,74) 1k 0r 3c ___ ___ [74,75) 1k 0r 1c ___ ___ [103,109) 0k 6r 0c C8 ___ ___ [22,49) 3k 6r 0c

MVM C2 3762 808 2260 266 d1 d2 d3 d4 .84 .18 .51 .06 64 .57 .22 .71 .34 82 .51 .22 .74 .38 83 (F-MN)*3 Ct gp3 0 1 2 2 1 1 3 1 2 5 1 15 20 2 3 23 1 3 26 2 2 28 1 1 29 1 2 31 1 2 33 2 2 35 2 2 37 1 1 38 3 1 39 1 1 40 1 1 41 1 1 42 1 4 46 1 1 47 2 2 49 1 1 50 1 1 51 1 2 53 1 1 54 2 2 56 1 2 58 1 1 59 2 2 61 2 1 62 2 1 63 3 1 64 1 1 65 2 2 67 3 1 68 2 1 69 1 1 70 2 1 71 2 1 72 2 2 74 1 1 75 1 2 77 1 1 78 1 1 79 1 2 81 1 2 83 1 1 84 1 3 87 2 1 88 1 1 89 1 1 90 2 1 91 1 1 92 2 3 95 1 1 96 2 1 97 1 2 99 1 2 101 2 2 103 1 3 106 3 3 109 1 1 110 1 1 111 1 .81 .28 -.28 .42 13... .53 .23 .73 .37 39 C12 4*F-M g3 0 2 4 4 1 4 8 2 2 10 1 2 12 1 2 14 1 3 17 1 1 18 1 2 20 1 1 21 1 1 22 1 2 24 3 1 25 1 2 27 1 1 28 1 2 30 1 4 34 1 2 36 2 2 38 1 3 41 2 3 44 1 2 46 2 2 48 1 2 50 1 2 52 2 2 54 1 1 55 1 1 56 1 1 57 1 1 58 3 1 59 1 1 60 2 2 62 1 1 63 4 2 65 1 1 66 1 1 67 1 1 68 1 1 69 3 3 72 1 2 74 1 2 76 1 2 78 1 1 79 1 3 82 1 2 84 1 1 85 1 4 89 1 1 90 1 2 92 1 1 93 1 IRIS GM GV ACCURACY IRIS SEEDS WINE GV 82.7 94 62.7 MVM 94 93.3 66.7 GM 94.7 96 81.3 C23 F-M*3 g3 3847 818 2284 257 .96 .22 .06 -.14 15 0 1 6 6 1 2 8 1 4 12 1 3 15 1 1 16 1 2 18 2 8 26 1 2 28 1 1 29 1 1 30 1 2 32 1 1 33 1 3 36 1 3 39 2 1 40 1 1 41 2 1 42 2 2 44 2 2 46 1 1 47 2 5 52 1 1 53 1 3 56 1 1 57 1 3 60 1 1 61 1 1 62 1 2 64 1 6 70 2 5 75 1 2 77 2 3 80 1 9 89 1 8 97 1 F-MN gp8 0 2 3 3 5 1 4 5 1 5 14 1 6 11 1 7 6 1 8 1 1 9 5 1 10 1 5 15 1 8 23 1 2 25 2 2 27 1 2 29 1 1.. 68 1 .88 .09 -.98 -.18 168 -.29 .13 -.88 -.36 417 -.36 .09 -.86 -.36 420 F-MN Ct gp5 0 1 3 3 2 1 4 1 2 6 1 1 7 1 2 9 2 1 10 1 2 12 3 1 13 1 1 14 3 1 15 4 1 16 2 1 17 3 1 18 1 1 19 6 1 20 3 1 21 1 1 22 2 1 23 2 1 24 6 1 25 7 1 26 2 1 27 3 1 28 2 1 29 6 1 30 3 1 31 2 1 32 3 1 33 3 1 34 3 1 35 3 1 36 5 1 37 1 1 38 2 1 39 1 1 40 2 1 41 1 2 43 1 2 45 1 1 46 1 1 47 1 5 52 1 8 60 2 1 61 3 1 62 4 1 63 3 1 64 13 1 65 12 1 66 4 1 67 5 1 68 2 2 70 2 .90 .24 .37 .04 180 .41 -.04 .84 .35 418 .36 -.08 .86 .36 420 F-MN Ct gp3 0 2 2 2 2 1 3 2 1 4 5 1 5 7 1 6 16 1 7 6 1 8 4 1 9 4 1 10 2 8 18 1 5 23 1 2 25 2 2 27 1 2 29 1 1 30 1 1 31 1 1 32 2 1 33 1 1 34 3 1 35 5 1 36 4 1 37 3 1 38 1 1 39 4 1 40 3 1 41 3 1 42 4 1 43 4 1 44 2 1 45 5 1 46 7 1 47 3 1 48 2 1 49 1 1 50 3 1 51 4 1 52 3 1 53 2 1 54 3 1 55 3 1 56 3 1 57 1 1 58 4 3 61 2 1 62 1 1 63 1 2 65 1 1 66 1 1 67 2 3 70 1 ___ 1e 0i -.36 .09 -.86 -.36 105 -.54 -0.17 -.76 -.33 118 C1 2*(F-M g3 0 2 4 4 1 1 5 1 1 6 1 5 11 1 2 13 1 3 16 1 2 18 1 3 21 1 1 22 1 1 23 1 2 25 2 1 26 2 2 28 2 1 29 1 2 31 3 1 32 1 2 34 1 1 35 2 1 36 2 1 37 4 3 40 2 1 41 1 2 43 3 2 45 1 2 47 4 1 48 1 1 49 2 1 50 4 1 51 3 2 53 5 1 54 2 1 55 2 1 56 1 1 57 3 2 59 3 2 61 2 2 63 1 1 64 1 1 65 2 2 67 1 1 68 1 1 69 2 1 70 2 1 71 3 1 72 1 1 73 2 2 75 1 1 76 1 1 77 1 1 78 1 1 79 1 1 80 1 2 82 2 10 92 1 2 94 2 2 96 1 __2e 5i 50s1i C1 C2 ___ 4e 1i C21 ___ 19e1i C22 4(F-) g4 0 1 6 6 1 4 10 1 2 12 1 4 ... 33 2 1 34 1 4 38 1 1 39 1 3 ... 79 1 2 81 1 5 86 1 2 88 2 2 90 1 1 91 1 1 92 2 2 94 1 1 95 1 2 97 1 1 98 1 3 101 2 1 102 2 4 106 1 1 107 1 2 109 1 1 110 2 1 111 2 6 117 1 1 118 1 1 119 1 1 120 1 ___50s1i C1 ___ 6e 0i ___ 18e C221 29e 14i ___ ___ ___ 19e1i C22 ___28i C11 ___ 16e11i 18e 11i C123 ___ 6e ___ 2e ___ 3e2i C221 8F- g5 0 1 7 7 1 4 11 1 5 16 1 1 17 1 3 20 1 1 21 1 2 23 1 1 24 1 5 29 1 3 32 2 2 34 1 1 35 1 4 39 3 5 44 1 3 47 2 3 50 1 3 53 1 4 57 1 3 60 1 3 63 1 1 64 2 5 69 2 1 70 1 3 73 1 1 74 1 1 75 1 4 79 1 1 80 2 2 82 2 1 83 1 1 84 1 2 86 1 4 90 1 5 95 1 ___1e ___ 0e 3i ___ 2e ___ 26i ___ 0e 4i C221 8F-)g5 0 1 7 7 1 4 11 1 5 16 1 1 17 1 3 20 1 1 21 1 2 23 1 1 24 1 5 29 1 3 32 2 2 34 1 1 35 1 4 39 3 5 44 1 3 47 2 3 50 1 3 53 1 4 57 1 3 60 1 3 63 1 1 64 2 5 69 2 1 70 1 3 73 1 1 74 1 1 75 1 4 79 1 1 80 2 2 82 2 1 83 1 1 84 1 2 86 1 4 90 1 5 95 1 ___50e 49i C1 __ 4e1i ___ 3e . -.034 .37 -.31 .87 4 C123 12*F-M g4 0 1 6 6 1 10 16 1 2 18 1 3 21 1 1 22 1 1 23 1 6 29 1 3 32 1 3 35 1 5 40 2 5 45 1 4 49 1 1 50 2 4 54 1 2 56 1 5 61 2 1 62 1 2 64 1 1 65 1 2 67 1 3 70 1 1 71 1 12 83 1 1 84 1 1 85 1 __ 1i . ___ 50e 40i C2 9i C3 ___ 1e . _46e 21i C12 ___9e ___ 5e 1i ___ 4e C13 ___ 27e 16i C23 ___9e . ___ 50s1i C2 ___ 9e1i . __9e2i MVM C2 2(F-)g4 0 1 4 4 1 1 5 1 4 9 1 3 ... 69 1 4 73 1 1 74 1 2 76 2 4 80 1 4 84 1 2 86 2 5 91 1 ___ 9i C24 _ 4e . __9e2i ___ 3e __ 0e 2i . 47e 40i C22 ___ 8i ___ 3i ___ 2e6i . ___ 5e10i ___ _3i ___ 1i ___ 0e 11i ___ 2e1i ___ 5e11i

CONCRETE GM MVM C11 F-/4 g4 0 4 2 2 1 2 4 4 2 6 25 2 8 2 1 9 7 1 10 4 1 11 9 2 13 3 1 14 6 1 15 4 1 16 1 3 19 5 4 23 2 3 26 5 1 27 4 1 28 9 1 29 5 2 31 6 1 32 5 3 35 6 5 40 2 C232 g2 F-M/8 0 1 1 1 1 1 2 2 1 3 1 2 5 2 1 6 1 1 7 2 1 8 2 1 9 1 7 16 1 1 17 3 1 18 2 2 20 7 1 21 8 1 22 7 1 23 1 2 25 2 1 26 3 1 27 2 1 28 3 1 29 1 1 30 1 1 31 2 2 33 1 1 34 4 1 35 3 3 38 3 1 39 8 11 50 2 1 51 1 MVM (F-)/4 gp4 C23 g3 F-M/8 0 2 2 2 1 1 3 3 1 4 3 1 5 1 1 6 1 1 7 6 1 8 1 1 9 8 1 10 2 1 11 6 1 12 2 1 13 5 1 14 2 1 15 2 3 18 1 1 19 7 1 20 1 1 21 3 1 22 1 1 23 2 1 24 4 1 25 1 2 27 8 1 28 9 1 29 4 2 31 2 1 32 1 1 33 3 1 34 3 2 36 7 1 37 12 2 39 1 1 40 1 1 41 1 1 42 6 6 48 1 2 50 2 0L 32M 13H 11L 13M 54H ACCURACY CONCRETE IRIS SEEDS WINE GV 76 82.7 94 62.7 MVM 78.8 94 93.3 66.7 GM 83 94.7 96 81.3 C2-.6 .2 -.07 .771 6882.. -.72 .19 -.40 .54 9251 .38 .14 -.79 .46 11781 F-m/8 g4 C2 0 1 2 2 1 1 3 1 2 5 2 3 8 1 2 10 1 1 11 1 5 16 1 2 18 1 5 23 1 1 24 1 1 25 2 1 26 2 1 27 1 2 29 4 1 30 2 1 31 2 1 32 1 1 33 3 2 ... 1s 65 1 X g4 (F-MN)/8 0 2 2 2 1 2 4 2 1 5 1 3 8 2 3 11 1 1 12 3 2 14 4 1 15 3 1 16 3 1 17 2 1 18 3 1 19 6 1 20 3 1 21 3 1 22 2 1 23 5 1 24 4 1 25 3 1 26 6 1 27 3 1 28 1 1 29 6 1 30 3 1 31 2 1 32 3 1 33 3 1 34 1 2 36 3 1 37 1 1 38 2 1 39 3 1 40 5 1 41 1 1 42 6 1 43 1 1 44 3 2 46 5 1 47 1 1 48 3 1 49 1 1 50 2 1 51 1 1 52 1 1 53 1 1 54 1 1 55 1 1 56 3 1 57 3 2 59 1 2 61 1 1 62 3 3 65 2 9 74 1 4 78 1 3 81 1 2 83 1 3 86 1 2 88 1 2 90 1 1 91 1 4 95 1 2 97 1 1 98 1 2 100 1 4 104 1 3 107 1 0 1 1 1 1 4 5 1 1 ... 1s 46 4 3 49 1 7 56 1 2 58 1 3 61 1 4 65 1 1 66 1 3 69 1 2 71 1 6 77 1 3 80 1 3 83 1 3 86 1 14 100 1 3 103 1 2 105 1 3 108 2 4 112 1 ___ 2M C2 gp8 (F-MN)/5 0 2 2 2 1 2 4 2 1 5 1 3 8 2 3 11 1 1 12 2 2 14 4 1 15 3 1 16 3 1 17 2 1 18 3 1 19 6 1 20 3 1 21 3 1 22 1 1 23 5 1 24 3 1 25 3 1 26 6 1 27 3 1 28 1 1 29 6 1 30 3 1 31 2 1 32 1 1 33 3 1 34 1 2 36 3 2 38 2 1 39 2 1 40 5 1 41 1 1 42 6 1 43 1 1 44 3 2 46 5 1 47 1 1 48 1 1 49 1 1 50 2 1 51 1 1 52 1 1 53 1 1 54 1 1 55 1 1 56 3 1 57 2 2 59 1 2 61 1 1 62 3 3 65 2 9 74 1 4 78 1 8 86 1 2 88 1 2 90 1 5 95 1 2 97 1 1 98 1 2 100 1 4 104 1 C21 0L 8M 0H C1 43L 33M 55H C22 2M 0H C23 C211 g5 F-M)/4 0 1 6 6 2 1 7 2 5 12 1 1 13 4 1 14 1 1 15 4 2 17 1 1 18 2 1 19 2 2 21 2 1 22 3 1 23 1 1 24 3 4 28 1 14 42 1 2 44 1 1 45 1 3 48 2 2 50 1 5 55 1 2 57 1 1 58 1 5 63 1 1 64 1 7 71 1 11 82 1 16 98 2 g4 F-MN/8 0 1 2 2 1 2 4 1 2 6 1 1 7 1 1 8 1 2 10 1 1 11 1 1 12 1 1 13 1 3 16 2 3 19 1 2 21 1 5 26 1 1 27 1 1 28 2 1 29 2 1 30 1 2 32 5 1 33 2 1 34 2 1 35 1 1 36 3 1 37 3 1 38 3 1 39 5 1 40 3 1 41 7 1 42 6 1 43 3 1 44 5 1 45 1 1 46 3 1 47 3 1 48 4 1 49 7 1 50 4 1 51 6 1 52 10 1 53 3 1 54 4 1 55 8 1 56 5 1 57 3 1 58 7 1 59 2 1 60 2 1 61 1 1 62 2 2 64 1 1 65 2 1 66 1 1 67 2 C21 g4 F-M/4 0 1 1 1 1 3 4 1 3 7 2 1 8 2 1 9 1 2 11 1 2 13 4 1 14 2 1 15 4 1 16 1 2 18 2 1 19 3 1 20 1 1 21 2 1 22 6 2 24 2 1 25 3 1 26 1 2 28 2 2 30 1 1 31 1 2 33 1 4 37 1 1 38 2 1 39 2 1 40 1 1 41 1 1 42 1 1 43 2 1 44 1 1 45 2 1 46 1 1 47 1 1 48 1 1 49 2 2 51 2 4 55 1 1 56 8 1 57 4 1 58 4 1 59 2 1 60 1 1 61 1 2 63 5 2 65 1 2 67 2 1 68 1 3 71 1 1 72 4 1 73 8 1 74 5 1 75 1 8 83 3 1 84 3 1 85 2 1 86 1 99 3 GV ___5L . C111 3L 23M 49H ___ 7M C2 ___ 4M C3 ___6M C4 ___ 30L 1M 4H C231 g4 F-M/8 0 1 7 ... 1s 12 1 2 14 6 1 15 7 4 19 1 1 20 3 1 21 3 1 22 2 1 23 1 2 25 1 2 27 1 2 29 1 1 30 1 1 31 1 2 33 1 6 39 1 3 42 1 4 46 1 10 56 2 __20L5M . C1F-/4 g4 ___14M 0H C1 C2 0 1 1 1 1 7 8 1 4 12 1 4 16 1 2 18 1 2 20 2 1 21 2 2 ... 1s+2s 71 2 2 73 1 1 74 1 2 76 2 2 78 2 4 82 2 2 84 1 6 90 2 8 98 1 9 107 1 16 123 1 ___ 5L1M . ___ 4M . ___ 2L1M . C211 32L 13M 0H ___5L1M C11 43L 23M 53H _30L8H_ . 3L2M C212 g5 F-M/3 0 1 20 20 1 8 28 1 1 29 2 9 38 1 11 49 1 5 54 1 11 65 1 10 75 2 3 78 1 11 89 1 7 96 1 2 98 1 2 100 1 11 111 2 1 112 1 ___6M2H C212 7L 3M 10H 2L2M1H __6L3M . __1L2H C111 F-/4 g4 0 1 16 16 3 1 17 2 1 18 9 1 19 3 2 21 5 6 27 3 1 28 5 1 29 14 1 30 1 8 38 2 2 40 15 1 41 3 4 45 3 2 47 2 19 66 3 21 87 1 ___1L4M3H ___ __1L ___1L 1M4H ___ 8H 43L 38M 55H C2 0L 14M 0H C1 ___ 3L 2M18H 1L 21M 43L 28M 55H C21 __ 1L 2M 20H C213 4L 7M 38H ___4L 2M8H ___ 8H C214 0L 5M 7H ___ 2M9H ___ ___ . 1H 2M 0L 10M 0H C22 ___ __ 31H ___1L2H

ABALONE GV 0.11 0.09 0.03 0.14 2 0.27 0.86 0.33 0.27 73 1.00 0.00 0.00 0.00 5 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 0.00 1.00 0.00 0.00 56 0.25 0.88 0.31 0.25 73 0.00 0.00 1.00 0.00 8 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 0.00 0.00 0.00 1.00 5 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 1.00 1.00 0.00 0.00 93 0.26 0.87 0.32 0.26 73 1.00 0.00 1.00 0.00 27 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 1.00 0.00 0.00 1.00 22 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 1.00 1.00 1.00 0.00 154 0.27 0.87 0.33 0.27 73 1.00 1.00 0.00 1.00 141 0.26 0.87 0.33 0.26 73 1.00 0.00 1.00 1.00 57 0.29 0.84 0.36 0.29 72 0.26 0.87 0.32 0.26 73 0.00 1.00 1.00 1.00 154 0.27 0.87 0.33 0.27 73 1.00 1.00 1.00 1.00 216 0.27 0.86 0.33 0.27 73 GM MVM 1.00 0.00 0.00 0.00 23 0.71 0.23 0.66 0.01 47 C1 g3 400*F-M 0 1 1 1 1 6 7 1 3 10 2 2 12 3 2 14 3 1 15 1 3 18 1 2 20 1 2 22 3 4 26 1 3 29 1 3 32 1 1 33 1 2 35 1 2 37 2 2 39 1 1 40 1 5 45 2 2 47 1 1 48 2 1 49 1 2 51 1 1 52 2 1 53 2 1 54 2 2 56 1 2 58 3 1 59 1 1 60 1 2 62 2 1 63 1 1 64 2 3 67 4 1 68 1 1 69 2 1 70 1 3 73 1 2 75 2 1 76 2 2 78 1 1 79 2 2 81 1 1 82 1 1 83 1 1 ... 97 1 ACR CONC IRIS SEEDS WINE ABAL GV 76 83 94 63 73 MVM 79 94 93 67 79 GM 83 95 96 81 81 0.39 0.57 0.10 -0.72 0.21 0.57 0.44 0.09 -0.69 0.24 0.77 0.61 0.17 0.01 2.19 0.58 0.48 0.17 0.64 3.8 0.55 0.46 0.16 0.68 3.81 g3 200*F-M 0 1 11 11 1 14 25 1 17 42 1 1 43 1 5 48 1 3 51 1 2 ... 67 2 1 68 2 1 69 3 2 ... 1s 92 1 1H 1M _ 1H X g2 100(F-M) 3 2 3 6 1 2 8 1 1 9 2 3 12 1 3 15 2 1 16 1 2 18 2 1 19 1 1 20 2 1 21 3 1 22 2 1 23 1 1 24 6 1 25 1 1 26 1 2 28 3 1 29 2 1 30 2 2 32 3 1 33 2 1 34 3 1 35 5 1 36 4 1 37 4 1 38 3 1 39 5 1 40 3 1 41 2 1 42 1 1 43 2 1 44 3 1 45 4 1 46 2 1 47 3 1 48 3 1 49 1 1 50 3 1 51 1 1 52 1 1 53 7 1 54 4 1 55 3 1 56 3 1 57 4 1 58 2 1 59 1 1 60 3 1 61 4 1 62 2 2 64 2 1 65 1 1 66 1 2 68 3 1 69 2 1 70 1 4 74 1 2 76 1 3 79 2 1 80 2 3 83 2 2 85 1 4 89 1 13 102 1 0.25 0.30 -0.20 -0.90 0.18 -0.44 -0.37 -0.19 -0.79 0.81 -0.52 -0.42 -0.19 -0.72 0.83 C1 g3 300(F-M) 0 1 1 1 1 2 3 2 1 4 1 1 5 1 1 6 2 1 7 1 3 10 1 1 11 1 3 14 3 2 16 2 1 17 1 1 18 2 2 20 1 2 22 1 1 23 2 1 24 1 1 25 2 1 26 3 1 27 1 1 28 2 1 29 1 2 31 1 1 32 1 3 35 1 1 36 1 2 38 1 3 41 1 3 44 3 1 45 1 1 46 2 2 48 1 1 49 1 1 50 2 2 52 2 1 53 1 1 54 1 1 55 1 4 59 2 1 60 1 4 64 1 1 65 1 1 66 1 1 67 2 2 69 2 1 70 2 1 71 2 2 73 1 1 74 1 1 75 2 1 76 2 1 77 1 1 78 3 2 80 1 1 81 3 2 83 2 1 84 1 1 85 1 1 86 1 2 88 1 1 89 1 1 90 1 2 92 1 2M 1H _ 5M12H _ 6L . 1M _ 3L . 30L 85M 12H C1 C1 g3 100*F-M 0 1 6 6 1 1 ... 1s 54 1 2 56 2 3... 71 2 7M 4H . 1H 20L 84M 11H C11 10L1M 0H 12L 7M _ 3L4M _ C11 g3 400*F-M 0 1 1 1 1 4 5 1 3 8 4 1 9 1 3 12 2 2 .. 81 2 3 84 2 1 85 1 2M 1H _ 4M 1H _ 2L 0M 0H _ 1L19M1H _ 16M 8H C11 17L 78M 9H C111 3L 1.0 .00 .00 .00 10 .62 .41 .13 .65 46 .33 .29 .13 .89 56 C2 g3 300*F-M 0 1 8 8 1 1 9 1 2 11 1 1 12 1 1 13 3 1 14 1 2 16 2 1 17 1 1 18 3 2 20 2 1 21 1 3 24 1 1 25 1 2 27 2 1 28 1 1 29 2 1 30 1 1 31 1 2 33 2 1 34 1 1 35 1 2 37 1 1 38 3 1 39 1 1 40 1 5 45 1 1 46 1 2 48 1 6 54 1 4 58 1 1 59 1 3 62 1 1 63 1 1 64 1 4 68 1 1 69 1 14 83 1 3 86 1 23 109 1 7L 3M 0H _ C111 g3 1500*F-M 0 1 15 15 1 5 20 1 4 24 1 1 25 1 1 26 1 3 29 1 1 30 1 1 31 2 1 32 1 1 33 2 3 36 1 2 38 3 1 39 2 2 41 2 1 42 1 1 43 2 2 45 1 2 47 3 1 48 1 2 50 1 1 51 1 4 55 2 1 56 3 2 58 1 2 60 3 1 61 2 1 62 2 2 64 1 1 65 2 3 68 2 1 ... 112 1 4 116 2 .55 .43 .14 .27 .38 C11 g3 1000(F-M) 0 1 10 10 1 7 17 1 2 19 1 8 27 1 9 36 1 11 47 2 2 49 1 3 52 2 4 56 1 4 60 1 2 62 1 2 64 1 7 71 3 1 72 1 5 77 2 4 81 1 3 84 1 6 90 1 3L _ 3M_ 6L8M 0H _ 17M 2H . 13M 5H _ 1M 2H _ 4L 3M _ 0M 6H _ 1M 2H _ 4L 72M 15H C1 10L1M 0H 3M 1H _ 2L21M1H _ 12M 7H _ 3L13M2H 15H _ 1L7M _ 5M 10H _ 1M _ 4L 8M4H 1H 6M 5H _ 3L 30M1H 1M 1H _ 1H 3L 51M3H

KOSblogs d=UnitSTDVec g>6*avg GV on 22 highest STD KOS wds d=(.46 .16 .03 .32 .71 .07 .06 .03 .09 .03 .10 .10 .19 .04 .16 .14 .01 .02 .04 .02 .00 .02) d=e841 (highest STD). gp=1 Ct=8 C16 . outliers. Some of them are substantial MVM gaps>6*avg DOC W=841 1716 0 ... ... 1379 C0 2427 0 Doc F=DPPd Gap 24=MxGp 2682 0 2749 7.574 0.038 0 3029 2983 8.436 0.079 0 42 3402 8.629 0.052 0 2 864 9.184 0.053 0 10 2293 9.462 0.106 1 4 2994 13.45 0.055 0 316 1445 13.66 0.029 0 4 3399 14.05 0.099 0 6 185 14.21 0.156 1 1 2731 14.35 0.143 1 1 2948 14.65 0.066 0 5 1495 14.99 0.014 0 2 804 15.20 0.205 1 1 3177 15.42 0.034 0 6 1316 15.61 0.024 0 2 1335 16.01 0.028 0 3 1637 16.35 0.330 1 1 880 16.86 0.039 0 3 1509 17.03 0.176 1 1 2885 17.21 0.177 1 1 446 18.07 0.863 1 1 1197 18.65 0.005 0 4 3189 19.30 0.644 1 1 1252 20.65 1.352 1 1 2750 13 54 13 2293 13 183 13 2870 13 1222 13 3217 13 1519 13 8 C13 1027 1 ... ... 3427 1 743 C1 2164 14 otlrs 1656 14 3244 14 1709 14 185 15 otlrs 401 15 414 15 893 15 2731 16 otlrs 1396 16 3220 16 3190 16 1832 17 otlr 2852 18 otlrs 3201 18 1234 18 3189 19 otlr 1524 22 otlr 1529 24 otlr 1197 25 otlr 201 27 otlr 1150 29 otlr 1335 34 otlr 1 2 ... ... 2519 2 470 C2 868 3 ... ... 3224 3 274 C3 1882 4 ... ... 3257 4 175 C4 1434 5 ... ... 910 5 127 C5 Cluster size: d=USTDMVM 10 7 11 8 15 8 16 9 17 11 21 11 27 12 42 30 48 45 68 87 422 502 2667 2613 GV 3 3 4 4 4 5 6 6 10 42 316 3029 2753 6 ... ... 549 6 75 C6 1186 7 ... ... 1015 7 79 C7 503 8 ... ... 3156 8 43 C8 2971 9 ... ... 2182 9 39 C9 2868 10 ... ... 1316 10 32 C10 2648 11 ... ... 336 11 18 C11 2983 12 ... ... 3177 12 14 C12 3364 1804 185.38 0.56 0 3365 3399 186.38 1.00 1 3366 980 186.68 0.30 0 3367 1518 187.84 1.15 1 3368 2090 188.45 0.61 1 3369 890 189.10 0.65 1 3370 24 189.74 0.65 1 3371 2435 189.77 0.03 0 3372 804 190.14 0.36 0 3373 930 190.24 0.11 0 3374 1096 191.30 1.06 1 3375 1441 191.39 0.09 0 3376 2885 191.86 0.47 0 3377 2315 191.91 0.05 0 3378 699 192.04 0.13 0 3379 2108 194.34 2.30 1 3380 1316 195.58 1.24 1 3381 991 195.85 0.27 0 3382 1564 196.05 0.20 0 3383 2800 196.37 0.32 0 3384 880 196.62 0.25 0 3385 2038 196.75 0.13 0 3386 481 197.09 0.34 0 3387 480 197.85 0.76 1 3388 295 198.38 0.53 0 3389 1234 200.42 2.04 1 3390 2140 201.46 1.04 1 3391 3353 202.36 0.90 1 3392 3402 202.64 0.28 0 3393 45 202.86 0.21 0 3394 3017 204.63 1.77 1 3395 3365 207.54 2.91 1 3396 2436 207.77 0.24 0 3397 553 209.73 1.96 1 3398 2545 210.52 0.79 1 3399 54 213.63 3.11 1 3400 1933 214.58 0.95 1 3401 3201 216.16 1.57 1 3402 2895 217.18 1.02 1 3403 446 217.83 0.65 1 3404 2302 218.43 0.61 1 3405 2873 219.47 1.04 1 3406 3388 223.00 3.52 1 3407 1509 225.98 2.99 1 3408 32 229.46 3.48 1 3409 3189 231.30 1.84 1 3410 3228 231.43 0.13 0 3411 2107 232.39 0.96 1 3412 1150 232.79 0.40 0 3413 2279 236.69 3.90 1 3414 2289 237.43 0.74 1 3415 2385 238.03 0.60 0 3416 1037 245.93 7.90 1 3417 201 246.72 0.79 1 3418 1252 249.23 2.51 1 3419 1739 250.34 1.11 1 3420 2446 257.59 7.26 1 3421 1637 258.64 1.05 1 3422 3220 260.55 1.91 1 3423 1304 262.67 2.12 1 3424 2355 271.20 8.53 1 3425 232 293.86 22.66 1 3426 3411 299.23 5.37 1 3427 1955 303.42 4.19 1 3428 1832 328.03 24.61 1 3429 1197 335.83 7.81 1 3430 2852 364.01 28.18 1 AvgGp.0085 gp>6*avg ROW KOS F GAP CT 1 1791 0.2270 --- -- 2 1317 0.2920 0.065 1 2668 1602 6.6576 0.007 2667 3090 1390 9.8504 0.004 422 3132 1546 10.278 0.012 42 3148 2662 10.507 0.021 16 3216 505 11.289 0.019 68 3264 2219 11.994 0.027 48 3291 231 12.445 0.039 27 3302 710 12.631 0.038 11 3317 220 12.934 0.023 15 3338 405 13.315 0.028 21 3355 194 13.693 0.009 17 3368 12 14.151 0.078 8 3378 2731 14.590 0.011 10 3392 1096 15.459 0.022 5 0.1=AvgGp 64=#gaps Row#Doc#F 28.2=MxGp .6=GapThreshold 1 1791 5.67 Gap 0 ... ... ... ... ... 8 3389 7.00 0.19 0 9 2397 7.65 0.65 1 10 2841 7.82 0.17 0 ... ... ... ... ... 2621 2334 89.40 0.06 0 2622 1122 90.00 0.60 1 2623 245 90.06 0.06 0 ... ... ... ... ... 3123 3169 132.06 0.00 0 3124 321 132.81 0.75 1 3125 2047 133.05 0.24 0 ... ... ... ... ... 3210 343 145.29 0.37 0 3211 2475 145.89 0.60 1 3212 458 146.10 0.21 0 ... ... ... ... ... 3240 542 151.15 0.09 0 3241 2569 151.76 0.61 1 3242 1143 151.92 0.15 0 ... ... ... ... ... 3285 1803 157.97 0.00 0 3286 2257 158.70 0.73 1 3287 2723 158.77 0.07 0 ... ... ... ... ... 3293 129 159.56 0.32 0 3294 2541 160.45 0.89 1 3295 2870 160.48 0.03 0 ... ... ... ... ... 3301 401 161.38 0.04 0 3302 2918 162.03 0.65 1 3303 100 162.07 0.04 0 ... ... ... ... ... 3312 1157 164.54 0.08 0 3313 185 165.26 0.72 1 3314 685 165.91 0.65 1 3315 2948 166.25 0.34 0 ... ... ... ... ... 3325 190 168.59 0.37 0 3326 2498 169.20 0.61 1 3327 264 169.31 0.11 0 3328 1611 169.64 0.33 0 3329 3052 169.96 0.32 0 3330 1002 170.43 0.47 0 3331 1628 170.64 0.20 0 3332 1241 171.80 1.16 1 3333 3155 172.00 0.20 0 ... ... ... ... ... 3342 861 173.84 0.15 0 3343 2509 174.98 1.13 1 3344 2293 175.65 0.67 1 3345 1257 175.67 0.02 0 3346 2776 176.04 0.37 0 3347 1422 177.15 1.11 1 3348 12 177.24 0.09 0 3349 183 177.26 0.02 0 3350 620 177.29 0.03 0 3351 679 179.08 1.79 1 3352 462 179.15 0.07 0 3353 3404 180.02 0.88 1 3354 1850 180.79 0.76 1 3355 3342 181.21 0.43 0 3356 1396 183.04 1.82 1 3357 2982 183.26 0.22 0 ___ ___ gap=.65 Ct=9 C1 ___ ___ gap=.6 Ct=2613 C2 ___ ___ gap=.75 Ct= 502 C3 ___ ___ gap=.6 Ct= 87 C4 ___ ___ gap=.61 Ct=30 C5 ___ ___ gap=.73 Ct=45 C6 ___ ___ gap=.89 Ct=8 C7 ___ ___ gap=.65 Ct=8 C8 ___ ___ gp=.72 Ct= 11 C9 ___ ___ gp=.65 Ct=1 outlr ___ ___ gp=.61 Ct=12 C11 ___ ___ gp=1.2 Ct=6 C12 ___ ___ gp=1.1 Ct=11 C13 ___ ___ gap=.67 Ct=1 utlr ___ ___ gp=1.1 Ct=3 C15 ___ ___ gp=1.8 Ct=4 C16 ___ ___ gp=1.8 Ct=5 otl;r

GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) UCUC(1101) UCUC(1011) UCUC(0111) 0.58 0.58 0.00 0.58 6756 0.69 0.10 -0.43 0.57 11945 0.65 0.10 -0.58 0.48 12599 0.60 0.11 -0.66 0.45 12784 0.55 0.12 -0.70 0.45 12864 0.51 0.13 -0.72 0.45 12908 0.49 0.13 -0.73 0.46 12933 0.46 0.14 -0.74 0.46 12947 0.45 0.14 -0.75 0.47 12956 0.43 0.14 -0.76 0.47 12960 0.42 0.14 -0.76 0.47 12963 0.42 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.58 0.00 0.58 0.58 6414 0.82 -0.10 0.46 0.33 8390 0.93 -0.12 0.32 0.12 9506 0.97 -0.11 0.20 0.02 9889 0.99 -0.10 0.11 -0.00 10069 1.00 -0.08 0.02 0.01 10254 0.99 -0.06 -0.08 0.05 10508 0.98 -0.04 -0.18 0.11 10851 0.94 -0.01 -0.29 0.18 11263 0.89 0.02 -0.40 0.24 11695 0.82 0.05 -0.49 0.30 12084 0.75 0.07 -0.56 0.35 12391 0.68 0.09 -0.62 0.38 12609 0.62 0.10 -0.66 0.41 12751 0.57 0.12 -0.69 0.43 12839 0.53 0.12 -0.71 0.44 12892 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.00 0.58 0.58 0.58 3102 -0.15 0.02 0.71 0.68 5237 -0.34 -0.08 0.86 0.37 7997 -0.46 -0.12 0.88 -0.09 11648 -0.47 -0.13 0.81 -0.33 12756 -0.45 -0.14 0.77 -0.42 12928 -0.44 -0.14 0.76 -0.45 12955 -0.43 -0.14 0.76 -0.47 12962 -0.42 -0.14 0.76 -0.47 12964 -0.41 -0.14 0.76 -0.48 12965 -0.41 -0.15 0.76 -0.48 12966 CONC d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) -0.06 -0.19 0.83 -0.52 12619 -0.14 -0.18 0.82 -0.52 12758 -0.20 -0.17 0.82 -0.51 12843 -0.25 -0.17 0.81 -0.51 12895 -0.28 -0.16 0.80 -0.50 12925 -0.31 -0.16 0.79 -0.50 12943 -0.33 -0.16 0.79 -0.49 12953 -0.35 -0.16 0.79 -0.49 12959 -0.36 -0.15 0.78 -0.49 12962 -0.37 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.48 12965 -0.38 -0.15 0.78 -0.48 12966 -0.38 -0.15 0.77 -0.48 12967 0.71 0.00 0.00 0.71 9105 0.78 0.05 -0.32 0.53 11499 0.74 0.07 -0.50 0.44 12306 0.68 0.09 -0.60 0.42 12601 0.62 0.10 -0.65 0.42 12753 0.57 0.12 -0.69 0.43 12841 0.53 0.12 -0.71 0.45 12894 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.77 0.48 12966 0.40 0.15 -0.77 0.48 12967 0.00 0.71 0.71 0.00 3491 -0.19 -0.13 0.94 -0.25 12162 -0.25 -0.17 0.86 -0.41 12806 -0.28 -0.16 0.82 -0.47 12915 -0.31 -0.16 0.80 -0.49 12942 -0.33 -0.16 0.79 -0.49 12953 -0.35 -0.16 0.79 -0.49 12959 -0.36 -0.15 0.78 -0.49 12963 -0.37 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.48 12966 0.00 0.71 0.00 0.71 4926 0.01 0.20 -0.54 0.81 11209 0.09 0.18 -0.73 0.65 12473 0.16 0.18 -0.79 0.56 12765 0.22 0.17 -0.80 0.53 12861 0.26 0.17 -0.80 0.51 12907 0.29 0.16 -0.80 0.50 12932 0.32 0.16 -0.79 0.50 12947 0.34 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966 0.00 0.00 0.71 0.71 4951 -0.06 -0.09 0.89 0.45 6835 -0.16 -0.15 0.97 -0.02 10755 -0.23 -0.17 0.90 -0.33 12547 -0.28 -0.16 0.84 -0.44 12876 -0.31 -0.16 0.81 -0.48 12934 -0.33 -0.16 0.80 -0.49 12951 -0.34 -0.16 0.79 -0.49 12958 -0.35 -0.15 0.78 -0.49 12962 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 UCUC(1111) akk MVM 0.50 0.50 0.50 0.50 4385 0.83 -0.04 0.32 0.46 8393 0.95 -0.06 0.09 0.28 9943 0.97 -0.04 -0.09 0.20 10663 0.95 -0.01 -0.24 0.21 11151 0.90 0.01 -0.36 0.25 11601 0.83 0.04 -0.47 0.30 12007 0.76 0.07 -0.55 0.34 12334 0.69 0.09 -0.61 0.38 12569 0.63 0.10 -0.65 0.41 12726 0.58 0.11 -0.69 0.43 12824 0.54 0.12 -0.71 0.44 12883 0.50 0.13 -0.73 0.45 12918 0.48 0.13 -0.74 0.46 12939 0.46 0.14 -0.75 0.46 12951 0.44 0.14 -0.75 0.47 12958 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.76 0.48 12966 0.17 0.05 0.98 0.01 9327 0.06 -0.19 0.93 -0.30 11888 -0.04 -0.19 0.88 -0.44 12502 -0.12 -0.18 0.84 -0.49 12715 -0.19 -0.18 0.83 -0.50 12822 -0.24 -0.17 0.81 -0.50 12882 -0.27 -0.17 0.80 -0.50 12918 -0.30 -0.16 0.80 -0.50 12939 -0.32 -0.16 0.79 -0.49 12951 -0.34 -0.16 0.79 -0.49 12958 -0.35 -0.15 0.78 -0.49 12962 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 0.00 -0.00 0.00 -0.01 1 0.28 -0.19 0.49 -0.80 10378 0.18 -0.20 0.71 -0.65 11773 0.06 -0.20 0.79 -0.58 12296 -0.04 -0.19 0.82 -0.54 12563 -0.12 -0.18 0.82 -0.53 12724 -0.19 -0.18 0.82 -0.52 12823 -0.24 -0.17 0.81 -0.51 12883 -0.27 -0.17 0.80 -0.50 12918 -0.30 -0.16 0.80 -0.50 12939 -0.33 -0.16 0.79 -0.49 12951 -0.34 -0.16 0.79 -0.49 12958 -0.35 -0.15 0.78 -0.49 12962 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 1.00 0.00 0.00 0.00 10249 0.99 -0.05 -0.11 0.06 10585 0.97 -0.03 -0.21 0.13 10947 0.93 -0.00 -0.32 0.19 11370 0.87 0.03 -0.42 0.26 11796 0.80 0.05 -0.51 0.31 12168 0.73 0.08 -0.58 0.36 12453 0.66 0.09 -0.63 0.39 12649 0.61 0.11 -0.67 0.42 12776 0.56 0.12 -0.70 0.43 12855 0.52 0.13 -0.72 0.45 12902 0.49 0.13 -0.73 0.46 12929 0.47 0.14 -0.74 0.46 12945 0.45 0.14 -0.75 0.47 12954 0.44 0.14 -0.75 0.47 12960 0.43 0.14 -0.76 0.47 12963 0.42 0.14 -0.76 0.47 12965 0.41 0.14 -0.76 0.48 12966 0.00 1.00 0.00 0.00 795 -0.23 0.33 -0.78 0.49 11645 -0.12 0.21 -0.82 0.52 12191 -0.01 0.19 -0.83 0.52 12469 0.09 0.19 -0.83 0.52 12660 0.16 0.18 -0.82 0.52 12783 0.22 0.17 -0.81 0.51 12859 0.26 0.17 -0.81 0.50 12904 0.29 0.16 -0.80 0.50 12931 0.32 0.16 -0.79 0.50 12946 0.33 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966 0.00 0.00 1.00 0.00 9950 -0.10 -0.18 0.93 -0.31 12279 -0.17 -0.18 0.86 -0.44 12749 -0.23 -0.17 0.83 -0.48 12865 -0.27 -0.17 0.81 -0.49 12911 -0.30 -0.16 0.80 -0.50 12935 -0.32 -0.16 0.79 -0.49 12949 -0.34 -0.16 0.79 -0.49 12956 -0.35 -0.15 0.78 -0.49 12961 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.37 -0.15 0.78 -0.48 12966 0.00 0.00 0.00 1.00 6686 0.08 0.16 -0.44 0.88 10572 0.16 0.17 -0.69 0.68 12435 0.22 0.17 -0.77 0.57 12816 0.26 0.17 -0.79 0.53 12901 0.29 0.16 -0.79 0.51 12932 0.32 0.16 -0.79 0.50 12947 0.34 0.16 -0.79 0.49 12955 0.35 0.15 -0.78 0.49 12960 0.36 0.15 -0.78 0.49 12963 0.37 0.15 -0.78 0.49 12965 0.37 0.15 -0.78 0.48 12966 0.71 0.71 0.00 0.00 4968 0.94 0.02 -0.29 0.18 11266 0.88 0.02 -0.40 0.24 11709 0.82 0.05 -0.49 0.30 12096 0.74 0.07 -0.57 0.35 12400 0.68 0.09 -0.62 0.38 12614 0.62 0.10 -0.66 0.41 12754 0.57 0.12 -0.69 0.43 12841 0.53 0.12 -0.71 0.44 12894 0.50 0.13 -0.73 0.45 12924 0.47 0.13 -0.74 0.46 12942 0.45 0.14 -0.75 0.47 12953 0.44 0.14 -0.75 0.47 12959 0.43 0.14 -0.76 0.47 12962 0.42 0.14 -0.76 0.47 12964 0.41 0.14 -0.76 0.48 12965 0.41 0.15 -0.77 0.48 12966 0.40 0.15 -0.77 0.48 12967 UCUC(1110) 0.58 0.58 0.58 0.00 4647 0.76 -0.15 0.62 -0.14 9784 0.72 -0.19 0.61 -0.27 10149 0.65 -0.20 0.64 -0.36 10422 0.56 -0.20 0.69 -0.41 10750 0.44 -0.21 0.74 -0.46 11149 0.32 -0.21 0.78 -0.49 11582 0.19 -0.21 0.81 -0.51 11988 0.07 -0.20 0.83 -0.52 12319 -0.04 -0.19 0.83 -0.52 12559 -0.12 -0.18 0.83 -0.52 12719 -0.18 -0.18 0.82 -0.51 12820 -0.23 -0.17 0.81 -0.51 12881 -0.27 -0.17 0.80 -0.50 12917 -0.30 -0.16 0.80 -0.50 12938 -0.32 -0.16 0.79 -0.49 12950 -0.34 -0.16 0.79 -0.49 12957 -0.35 -0.15 0.78 -0.49 12961 -0.36 -0.15 0.78 -0.49 12964 -0.37 -0.15 0.78 -0.49 12965 -0.38 -0.15 0.78 -0.48 12966 UCUC(1010) 0.71 0.00 0.71 0.00 9007 0.69 -0.18 0.67 -0.21 10074 0.62 -0.20 0.68 -0.33 10486 0.52 -0.21 0.72 -0.41 10867 0.40 -0.21 0.76 -0.46 11289 0.27 -0.21 0.80 -0.50 11721 0.15 -0.20 0.82 -0.51 12106 0.03 -0.20 0.83 -0.52 12408 On these pages we display the variance hill-climb for each of the four datasets (Concrete, IRIS, Seeds, Wine) for a grid of starting unit vectors, d. I took the circumscribing unit non-negative cube and used all the Unitized diagonals. In low dimension (all dimension=4 here) this grid is very nearly a uniform grid. Note that this will work less and less well as the dimension grows. In all cases, the same local max and nearly the same unit vector are reached.

GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) 2 SEEDS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM WINE d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM IRIS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM 1.00 0.00 0.00 0.00 8 0.97 0.16 -0.11 0.14 9 0.00 1.00 0.00 0.00 0 0.96 0.23 -0.14 0.13 9 0.00 0.00 1.00 0.00 2 -0.36 -0.07 0.93 -0.00 4 -0.82 -0.15 0.55 -0.09 8 -0.94 -0.16 0.27 -0.12 9 0.00 0.00 0.00 1.00 0 0.97 0.15 -0.00 0.19 9 0.71 0.71 0.00 0.00 6 0.97 0.17 -0.12 0.13 9 0.71 0.00 0.71 0.00 4 0.96 0.16 0.20 0.15 8 0.97 0.16 -0.05 0.14 9 0.71 0.00 0.00 0.71 5 0.97 0.16 -0.10 0.14 9 0.00 0.71 0.71 0.00 1 0.19 0.06 0.98 0.08 2 0.33 0.04 0.94 0.10 3 0.70 0.11 0.69 0.14 5 0.96 0.16 0.18 0.15 8 0.97 0.16 -0.06 0.14 9 0.00 0.71 0.00 0.71 0 0.97 0.20 -0.08 0.15 9 0.00 0.00 0.71 0.71 1 0.08 -0.01 0.99 0.09 2 -0.07 -0.03 1.00 0.05 3 -0.51 -0.10 0.86 -0.03 5 -0.88 -0.15 0.44 -0.10 8 -0.95 -0.16 0.23 -0.13 9 0.58 0.58 0.58 0.00 4 0.96 0.17 0.15 0.15 8 0.97 0.16 -0.07 0.14 9 0.58 0.58 0.00 0.58 5 0.97 0.17 -0.10 0.14 9 0.58 0.00 0.58 0.58 4 0.96 0.16 0.17 0.16 8 0.97 0.16 -0.06 0.14 9 0.00 0.58 0.58 0.58 1 0.56 0.11 0.80 0.14 4 0.92 0.15 0.31 0.15 8 0.98 0.16 -0.02 0.14 9 0.50 0.50 0.50 0.50 4 0.97 0.17 0.13 0.15 8 0.97 0.16 -0.07 0.14 9 0.98 0.14 0.06 0.13 9 -0.62 0.36 0.27 -0.30 4 -0.95 -0.15 0.22 -0.13 9 1.00 0.00 0.00 0.00 4 0.40 -0.06 -0.91 -0.07 497 0.02 -0.25 -0.97 -0.01 608 0.00 1.00 0.00 0.00 82 -0.00 0.49 0.87 0.00 577 -0.01 0.28 0.96 0.00 608 0.00 0.00 1.00 0.00 567 -0.01 0.25 0.97 0.00 608 0.00 0.00 0.00 1.00 1 -0.20 0.17 0.84 0.47 455 -0.02 0.26 0.96 0.01 608 0.71 0.71 0.00 0.00 42 0.02 0.51 0.86 -0.00 570 -0.01 0.29 0.96 0.00 608 0.71 0.00 0.71 0.00 277 -0.01 0.25 0.97 0.00 608 0.71 0.00 0.00 0.71 2 0.46 0.00 -0.88 0.12 447 0.02 -0.25 -0.97 -0.00 608 0.00 0.71 0.71 0.00 472 -0.01 0.31 0.95 0.00 608 0.00 0.71 0.00 0.71 42 -0.01 0.48 0.88 0.01 578 -0.01 0.28 0.96 0.00 608 0.00 0.00 0.71 0.71 287 -0.02 0.25 0.97 0.01 608 0.58 0.58 0.58 0.00 310 -0.01 0.31 0.95 0.00 607 -0.01 0.27 0.96 0.00 608 0.58 0.58 0.00 0.58 29 0.02 0.50 0.86 0.01 572 -0.01 0.29 0.96 0.00 608 0.58 0.00 0.58 0.58 186 -0.01 0.25 0.97 0.01 608 0.00 0.58 0.58 0.58 317 -0.01 0.30 0.95 0.01 608 0.50 0.50 0.50 0.50 234 -0.01 0.31 0.95 0.01 607 -0.01 0.27 0.96 0.00 608 0.07 0.15 0.98 0.12 588 -0.01 0.26 0.97 0.00 608 -0.13 -1.00 -3.07 -0.03 6314 0.01 -0.27 -0.96 -0.00 608 1.00 0.00 0.00 0.00 68 0.45 -0.03 0.83 0.34 415 0.36 -0.08 0.86 0.36 420 0.00 1.00 0.00 0.00 19 -0.10 0.48 -0.82 -0.30 334 -0.34 0.10 -0.86 -0.36 420 0.00 0.00 1.00 0.00 311 0.35 -0.09 0.86 0.35 420 0.00 0.00 0.00 1.00 58 0.34 -0.08 0.85 0.39 420 0.71 0.71 0.00 0.00 39 0.53 0.12 0.78 0.33 390 0.37 -0.07 0.86 0.36 420 0.71 0.00 0.71 0.00 316 0.38 -0.07 0.85 0.35 420 0.71 0.00 0.00 0.71 114 0.40 -0.05 0.84 0.36 419 0.36 -0.08 0.86 0.36 420 0.00 0.71 0.71 0.00 133 0.37 -0.04 0.86 0.36 419 0.36 -0.08 0.86 0.36 420 0.00 0.71 0.00 0.71 27 0.41 0.06 0.82 0.40 410 0.37 -0.08 0.86 0.36 420 0.00 0.00 0.71 0.71 312 0.35 -0.09 0.86 0.36 420 0.58 0.58 0.58 0.00 193 0.40 -0.04 0.85 0.35 419 0.36 -0.08 0.86 0.36 420 0.58 0.58 0.00 0.58 72 0.43 0.01 0.83 0.36 414 0.37 -0.08 0.86 0.36 420 0.58 0.00 0.58 0.58 349 0.37 -0.07 0.85 0.36 420 0.00 0.58 0.58 0.58 185 0.36 -0.05 0.85 0.37 420 0.50 0.50 0.50 0.50 243 0.90 0.24 0.37 0.04 180 0.41 -0.04 0.84 0.35 418 0.36 -0.08 0.86 0.36 420 0.90 0.24 0.37 0.04 180 0.41 -0.04 0.84 0.35 418 0.36 -0.08 0.86 0.36 420 -0.00 -0.04 0.05 0.01 1 0.35 -0.09 0.86 0.36 420 As we all know, Dr. Ubhaya is the best Mathematician on campus and he is attempting to prove three things: 1. That a GV-hill-climb that does not reach the global max Variance is rare indeed. 2. That one is guaranteed to reach the global maximum with at least one of the coordinate unit vectors (so a 90 degree grid will always suffice). 3. That akk will always reach the global max.

Finding round clusters that aren't DPPd separable? (no linear gap) d Find the golf ball? Suppose we have a white mask pTree. No linear gaps exits to reveal it. Search a grid of d-tubes until a DPPd gap is found in the interior of the tube (Form mask pTree for interior of the d-tube. Apply DPPd that mask to reveal interior gaps.) Look for conical gaps (fix the the cone point at the middle of tube) over all cone angles (look for an interval of angles with no points). Notice that this method includes DPPd since a gap for a cone angle of 90 degrees is linear.

FAUST Gap Revealer Width  24 so compute all pTree combinations down to p4 and p'4 d=M-p 0 &p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=3 0 &p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=1 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=1 &p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=3 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 C=2 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=2 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 C=5 &p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 C=2 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=2 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=6 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 C=2 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=2 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=8 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=2 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 C=8 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 C10 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 z1 z2 z7 2 z3 z5 z8 3 z4 z6 z9 4 za 5 M 6 7 8 zf 9 zb a zc b zd ze c 0 1 2 3 4 5 6 7 8 9 a b c d e f F=zod 11 27 23 34 53 80 118 114 125 114 110 121 109 125 83 p6 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 p5 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 p4 0 1 1 0 1 1 1 1 1 1 0 1 0 1 1 p3 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 p2 0 0 1 0 1 0 1 0 1 0 1 0 1 1 0 p1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 1 p0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 p6' 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 p5' 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 p4' 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 p3' 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 p2' 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 p1' 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 p0' 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 Z z1 1 1 z2 3 1 z3 2 2 z4 3 3 z5 6 2 z6 9 3 z7 15 1 z8 14 2 z9 15 3 za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 p= [011 0000, 011 1111] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5 [000 0000, 000 1111]= [0,15]=[0,16) has 1 point, z1. This is a 24 thinning. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier) Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually  24 (the calculation of the min is a pTree process - no x looping required!) [001 0000, 001 1111] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier). [010 0000 , 010 1111] = [32,48). z4od=34 is within 2 of 32, so z4 is not declared an anomaly. [111 0000 , 111 1111]= [112,128) z7od=118 z8od=114 z9od=125 zaod=114 zcod=121 zeod=125 No 24 gaps. But we can consult SpS(d2(x,y) for actual distances: [110 0000 , 110 1111]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides. [100 0000 , 100 1111]= [64, 80). This is clearly a 24 gap. [101 0000 , 101 1111]= [80, 96). z6od=80, zfod=83 Which reveals that there are no 24 gaps in this subcluster. And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway. X1 X2 dX1X2 z7 z8 1.4 z7 z9 2.0 z7 z10 3.6 z7 z11 9.4 z7 z12 9.8 z7 z13 11.7 z7 z14 10.8 z8 z9 1.4 z8 z10 2.2 z8 z11 8.1 z8 z12 8.5 z8 z13 10.3 z8 z14 9.5 X1 X2 dX1X2 z9 z10 2.2 z9 z11 7.8 z9 z12 8.1 z9 z13 10.0 z9 z14 8.9 z10 z11 5.8 z10 z12 6.3 z10 z13 8.1 z10 z14 7.3 X1 X2 dX1X2 z11 z12 1.4 z11 z13 2.2 z11 z14 2.2 z12 z13 2.2 z12 z14 1.0 z13 z14 2.0

FAUST Tube Clustering:(This method attempts to build tubular-shaped gaps around clusters) y (yof) (yof) (yof) f |f| f |f| f f o y - f y - = y - squared is y- yo fof fof fof f |f| f |f| yo dot prod proj len (yof)2 (yof)2 (yof)2 (yof)2 f Gaps in dot product lengths [projections] on the line. + + fof squared = yoy - 2 squared = yoy - 2 fof (fof)2 fof fof y ( (y-p)o(q-p) )2 Squared y-p on q-p Projection Distance = (y-p)o(y-p) - (q-p)o(q-p) 1st 2 (yo(q-p)-p o(q-p = yoy -2yop+ pop- |q-p| |M-p| |q-p| |M-p| M-p |M-p| (y-p)o (yof)2 Squared y on f Proj Dis = yoy - For the dot product length projections (caps) we already needed: fof tube cap gap width po M-p ) = ( yo(M-p)- tube radius gap width q Allows for a better fit around convex clusters that are elongated in one direction (not round). Exhaustive Search for all tubular gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in tubular gaps. a. SquareTubeRadius functional, STR(y) = (y-p)o(y-p) - ((y-p)od)2 b. TubeLength functional, TL(y) = (y-p)od Given a p, do we need a full grid of ds (directions)? No! d and -d give the same TL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p dot product projection distance That is, we needed to compute the greenconstants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)

Cone Clustering:(finding cone-shaped clusters) x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 52 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 14 1 i25 16 1 i40 18 2 i16 i42 19 2 i17 i38 20 2 i11 i48 22 2 23 1 24 4 i34 i50 25 3 i24 i28 26 3 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 49 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 23 6 i21 24 5 25 1 27 1 28 1 29 2 30 2 i7 41/43 e so picks e Cosine cone gap (over some  angle) Gap in dot product projections onto the cornerpoints line. Corner points x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 26 1 e21 e34 27 2 29 2 37 1 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 10 12 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 26 1 e21 e34 27 2 28 1 29 2 31 1 e35 37 1 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 11 2 i28 12 4 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 34 1 i39 43/50 e so picks out e

"Gap Hill Climbing": mathematical analysis rotation d toward a higher F-STD or grow 1 gap using support pairs: 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 0 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k l m n 4 b c q r s 3 d e f o p 2 g h 1 i 0 0 1 2 3 4 5 6 7 8 9 a b c f 1 e2 3 d4 5 6 c7 8 b9 a 9 8 7 6 5 a j k 4 b c q 3 d e f 2 1 0 =p d2-gap d2-gap p C123 p avg=14 q avg=17 0 1 2 3 3 2 4 4 5 7 6 4 7 8 8 2 9 11 10 4 12 3 13 1 20 1 21 1 22 2 23 1 27 2 28 1 29 1 30 2 31 4 d1-gap d1-gap 32 2 33 3 34 4 35 1 36 3 37 4 38 2 39 2 40 5 41 3 42 3 43 6 44 8 45 1 46 2 47 1 48 3 49 3 51 7 52 2 53 2 54 3 55 1 56 3 57 3 58 1 61 2 63 2 64 1 66 1 67 1 q= q d2 d1 d1 d2 F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows.Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning.This is easy since our method produces the pTree mask the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place. Dot F p=aaan q=aaax 0 6 1 28 2 7 3 7 4 1 5 1 9 7 10 3 11 5 12 13 13 8 14 12 15 4 16 2 17 12 18 5 19 6 20 6 21 3 22 8 23 3 24 3 C1<7 (50 Set) d2-gap >> than d1=gap (still not optimal.) Weight mean by the dist from gap? (d-barrel radius) 7<C2<16 (4i, 48e) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? C3>16 (46i, 2e) hill-climb gap at 16 w half-space avgs. C2uC3 p=avg<16 q=avg>16 0 1 1 1 2 2 3 1 7 2 9 2 10 2 11 3 12 3 13 2 14 5 15 1 16 3 17 3 18 2 19 2 20 4 21 5 22 2 23 5 24 9 25 1 26 1 27 3 28 2 29 1 30 3 31 5 32 2 33 3 34 3 35 1 36 2 37 4 38 1 39 1 42 2 44 1 45 2 47 2 No conclusive gaps Sparse Lo end: Check [0,9] 0 1 2 2 3 7 7 9 9 i39 e49 e8 e44 e11 e32 e30 e15 e31 i39 0 17 21 21 24 22 19 19 23 e49 17 0 4 4 7 8 8 9 9 e8 21 4 0 1 5 7 8 10 8 e44 21 4 1 0 4 6 8 9 7 e11 24 7 5 4 0 7 9 11 7 e32 22 8 7 6 7 0 3 6 1 e30 19 8 8 8 9 3 0 4 4 e15 19 9 10 9 11 6 4 0 6 e31 23 9 8 7 7 1 4 6 0 i39,e49,e11 singleton outliers. {e8,i44} doubleton outlier set There is a thinning at 22 and it is the same one but it is not more prominent. Next we attempt to hill-climb the gap at 16 using the mean of the half-space boundary. (i.e., p is avg=14; q is avg=17. Sparse Hi end: Check [38,47] distances 38 39 42 42 44 45 45 47 47 i31 i8 i36 i10 i6 i23 i32 i18 i19 i31 0 3 5 10 6 7 12 12 10 i8 3 0 7 10 5 6 11 11 9 i36 5 7 0 8 5 7 9 10 9 i10 10 10 8 0 10 12 9 9 14 i6 6 5 5 10 0 3 9 8 5 i23 7 6 7 12 3 0 11 10 4 i32 12 11 9 9 9 11 0 4 13 i18 12 11 10 9 8 10 4 0 12 i19 10 9 9 14 5 4 13 12 0 i10,i18,i19,i32,i36 singleton outliers {i6,i23} doubleton outlier Here, gap between C1,C2 is more pronounced Why? Thinning C2,C3 more obscure? It did not grow gap wanted to grow (tween C2 ,C3.

CAINE 2013 Call for Papers 26th International Conference on Computer Applications in Industry and Engineering September 25{27, 2013, Omni Hotel, Los Angles, Califorria, USA Sponsored by the International Society for Computers and Their Applications (ISCA) CAINE{2013 will feature contributed papers as well as workshops and special sessions. Papers will be accepted into oral presentation sessions. The topics will include, but are not limited to, the following areas: Agent-Based Systems Image/Signal Processing Autonomous Systems Information Assurance Big Data Analytics Information Systems/Databases Bioinformatics, Biomedical Systems/Engineering Internet and Web-Based Systems Computer-Aided Design/Manufacturing Knowledge-based Systems Computer Architecture/VLSI Mobile Computing Computer Graphics and Animation Multimedia Applications Computer Modeling/Simulation Neural Networks Computer Security Pattern Recognition/Computer Vision Computers in Education Rough Set and Fuzzy Logic Computers in Healthcare Robotics Computer Networks Fuzzy Logic Control Systems Sensor Networks Data Communication Scientic Computing Data Mining Software Engineering/CASE Distributed Systems Visualization Embedded Systems Wireless Networks and Communication Important Dates: Workshop/special session proposal . . May 2.5,.2.013 Full Paper Submis . .June 5,.2013. Notice Accept ..July.5 , 2013. Pre-registration & Camera-Ready Paper Due . . . ..August 5, 2013. Event Dates . . .Sept 25-27, 2013 SEDE Conf is interested in gathering researchers and professionals in the domains of SE and DE to present and discuss high-quality research results and outcomes in their fields. SEDE 2013 aims at facilitating cross-fertilization of ideas in Software and Data Engineering, The conference topics include, but not limited to: . Requirements Engineering for Data Intensive Software Systems. Software Verification and Model of Checking. Model-Based Methodologies. Software Quality and Software Metrics. Architecture and Design of Data Intensive Software Systems. Software Testing. Service- and Aspect-Oriented Techniques. Adaptive Software Systems . Information System Development. Software and Data Visualization. Development Tools for Data Intensive. Software Systems. Software Processes. Software Project Mgnt . Applications and Case Studies. Engineering Distributed, Parallel, and Peer-to-Peer Databases. Cloud infrastructure, Mobile, Distributed, and Peer-to-Peer Data Management . Semi-Structured Data and XML Databases. Data Integration, Interoperability, and Metadata. Data Mining: Traditional, Large-Scale, and Parallel. Ubiquitous Data Management and Mobile Databases. Data Privacy and Security. Scientific and Biological Databases and Bioinformatics. Social networks, web, and personal information management. Data Grids, Data Warehousing, OLAP. Temporal, Spatial, Sensor, and Multimedia Databases. Taxonomy and Categorization. Pattern Recognition, Clustering, and Classification. Knowledge Management and Ontologies. Query Processing and Optimization. Database Applications and Experiences. Web Data Mgnt and Deep Web May 23, 2013 Paper Submission Deadline June 30, 2013 Notification of Acceptance July 20, 2013 Registration and Camera-Ready Manuscript Conference Website: http://theory.utdallas.edu/SEDE2013/ ACC-2013 provides an international forum for presentation and discussion of research on a variety of aspects of advanced computing and its applications, and communication and networking systems. Important Dates May 5, 2013 - Special Sessions Proposal June 5, 2013 - Full Paper Submission July 5, 2013 - Author Notification Aug. 5, 2013 - Advance Registration & Camera Ready Paper Due CBR International Workshop Case-Based Reasoning CBR-MD 2013 July 19, 2013, New York/USA Topics of interest include (but are not limited to): CBR for signals, images, video, audio and text Similarity assessment Case representation and case mining Retrieval and indexing Conversational CBR Meta-learning for model improvement and parameter setting for processing with CBR Incremental model improvement by CBR Case base maintenance for systems Case authoring Life-time of a CBR system Measuring coverage of case bases Ontology learning with CBR Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop on Data Mining in Life Sciences DMLS Discovery of high-level structures, incl e.g. association networks Text mining from biomedical literatur Medical images mining Biomedical signals mining Temporal and sequential data mining Mining heterogeneous data Mining data from molecular biology, genomics, proteomics, pylogenetic classification With regard to different methodologies and case studies: Data mining project development methodology for biomedicine Integration of data mining in the clinic Ontology-driver data mining in life sciences Methodology for mining complex data, e.g. a combination of laboratory test results, images, signals, genomic and proteomic samples Data mining for personal disease management Utility considerations in DMLS, including e.g. cost-sensitive learning Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013 Workshop on Data Mining in Marketing DMM'2013In business environment data warehousing - the practice of creating huge, central stores of customer data that can be used throughout the enterprise - is becoming more and more common practice and, as a consequence, the importance of data mining is growing stronger. Applications in Marketing Methods for User Profiling Mining Insurance Data E-Markteing with Data Mining Logfile Analysis Churn Management Association Rules for Marketing Applications Online Targeting and Controlling Behavioral Targeting Juridical Conditions of E-Marketing, Online Targeting and so one Controll of Online-Marketing Activities New Trends in Online Marketing Aspects of E-Mailing Activities and Newsletter Mailing Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013 Workshop Data Mining in Ag DMA 2013Data Mining on Sensor and Spatial Data from Agricultural Applications Analysis of Remote Sensor Data Feature Selection on Agricultural Data Evaluation of Data Mining Experiments Spatial Autocorrelation in Agricultural Data Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013

DEFG ABC           But horizontal anti-chains are clusterngs from top down (or bottom up) method(s). Hierarchical Clustering Any maximal anti-chain (maximal set of nodes s.t no 2 directly connected) is a clustering. (dendogram offers many DE FG A BC G F D E C B

GV F=(DPP-MN)/4 Concrete(C, W, FA, A) med=71 med=40 med=18 med=61 med=14 med=56 med=10 med=62 med=86 med=57 med=34 med=9 med=21 med=23 med=71 med=33 med=17 C1 C2 C3 C4 0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 100 1 103 1 105 1 108 2 112 1 CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err HMedian=11 Avg=10.7 gap=3 ______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6Median=34 Avg=34 ______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5Median=47 Avt=47 Accuracy=90% ______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3 gap=2 ______ =64 2L 0M 2H CLUS 4.6.1 gap=3Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7 ______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3 Suppose we know (or want) 3 clusters, Low, Medium and High Strength. Then we find ______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3 Suppose we know that we want 3 strength clusters, Low, Medium and High. We can use an anti-chain that gives us exactly 3 subclusters two ways, one show in brown and the other in purple Which would we choose? The brown seems to give slightly more uniform subcluster sizes. Brown error count: Low (bottom) 11, Medium (middle) 0, High (top) 26, so 96/133=72% accurate. The Purple error count: Low 2, Medium 22, High 35, so 74/133=56% accurate. ______ gap=6 [74,90) 0L 4M 0H CLUS_2 What about agglomerating using single link agglomeration (minimum pairwise distance? ________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1 Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation. Should I have normalize the rounds? Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)? Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76? Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters? _____________At this level, FinalClus1={17M} 0 errors CONCRETE

Agglomerating using single link (min pairwise distance = min gap size! (glue min-gap adjacent clusters 1st) GV CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err HMedian=11 Avg=10.7 gap=3 ______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6Median=34 Avg=34 ______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5Median=47 Avt=47 Accuracy=90% ______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3 gap=2 ______ =64 2L 0M 2H CLUS 4.6.1 gap=3Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7 ______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3 The first thing we can notice is that outliers mess up agglomerations which are supervised by knowledge of the number of subclusters expected. Therefore we might remove outliers by backing away from all gap5 agglomerations, then looking for a 3 subcluster max anti-chains. What we have done is to declare F<7 and F>84 as extreme tripleton outliers sets; and F=79. F=40 and F=47 as singleton outlier sets because they are F-gapped by at least 5 (which is actually 10) on either side. The brown gives more uniform sizes. Brown errors: Low (bottom) 8, Medium (middle) 12 and High (top) 6, so 107/133=80% accurate. The one decision to agglomerate C4.7.1 to C4.7.2 (gap=3) instead of C4.3.2 to C4.7.2 (gap=3) lots of error. C4.7.1 and C4.7.2 are problematic since they are separate out, but in increasing F order, it's H M L M L, so if we suspected this pattern we would look for 5 subclusters. The 5 orange errors in increasing F-order are: 6, 2, 0, 0, 8 so 127/133=95% accurate. If you have ever studied concrete, you know it is a very complex material. The fact that it clusters out with a F-order pattern of HMLML is just bizarre! So we should expect errors. CONCRETE

Predictive Modeling for User Movie Ratings

Predictive Modeling for User Movie Ratings

Presentation Transcript