Week 6 Progress Update: kNN Analysis and Clustering Techniques
In Week 6, I collaborated with Enrique on refining kNN and threshold graphs; the best k-value was found to be 475. Despite initial noisy results, I explored clustering via the PICS method, a user-parametric approach for mining attributed graphs. This method potentially reveals insights from large datasets like Twitter and YouTube, with results illustrated in generated figures. While working with several scripts yielded no ideal outcomes yet, I remain committed to improving code and methodologies in the coming week.
Week 6 Progress Update: kNN Analysis and Clustering Techniques
E N D
Presentation Transcript
Week 6 Shelby Thompson
This week… • Emailed Enrique; my kNN/Threshold graphs were wrong • Redid them and experimented with many k-values; results are still too noisy • Values ranged from 5-500 • The greater the number, the closer the threshold graph was to the p-distance graph
Best Threshold Graph(non neighbors 0, thresh graph right) Best k-value was found to be 475
Best Threshold Graph(non neighbors Inf, thresh graph right) Best k-value was found to be 475
Clustering • Next looked at clustering • Used a paper Mahdi suggested: “PICS: Parameter-free Identification of Cohesive Subgroups in Large Attributed Graphs” by Leman Akoglu, Hanghang Tong, Brendan Meeder, and Christos Faloutsos • Paper proposed PICS method of clustering
PICS • Method for mining attributed graphs • Requires no user input/parameters • Running time scales linearly with total graph and attribute size • PICS can reveal useful insight into datasets such as Twitter and YouTube • The above datasets have tens of thousands of nodes
Images generated from PICS(Figure 1) • Figure 1 shows all of the nodes, separated, before any operation is performed on them
Images generated from PICS(Figure 2) • Figure 2 shows the node groups in Figure 1, divided based on the average location of the group and number of nodes in the group before the operations are performed
Images generated from PICS(Figure 3) • Figure 3 shows the node groups in Figure 1, divided based on the average location of the group and number of nodes in the group after the operations are performed
Images generated from PICS(Figure 4) • Figure 4 shows the major node groups after the clustering operations are performed
Other work this week… • Worked with a number of scripts • None have yielded good results yet • Will continue to work on them this coming week
kNN Code: • Part 2: • %kNN graph • knn=100; • knnIndZero = zeros(length(fbgTestIds),length(fbgTrainIds)); • for i = 1 : length(fbgTestIds) • [vals,ind] = sort(dist(i,:),'ascend'); • knnIndZero(i,ind(1:knn)) = 1; • end • % Threshold Graph • threshIndZero = zeros(length(fbgTestIds),length(fbgTrainIds)); • for i = 1 : length(fbgTestIds) • ind = dist(i,:) <= dist(i,i); • threshIndZero(i,ind) = 1; • end • figure;imagesc(zeroMatrix) • figure;imagesc(knnIndZero) • figure;imagesc(threshIndZero) Part 1: load('pf83_gabor_lbp_hog_2048.mat') dist = pdist2(fbgTestImgs',fbgTrainImgs','cosine'); figure;imagesc(dist); [rows, cols] = size(dist); zeroMatrix = zeros(length(fbgTestImgs),length(fbgTrainImgs); for i = 1:numel(fbgTestIds) for j = 1:numel(fbgTrainIds) if fbgTestIds(i) == fbgTrainIds(j) zeroMatrix(i,j) = 1; end end end
Clustering Code:(Runs fine but no good output) Part 1: load('data/A_call.mat') load('data/F_call.mat') load('pf83_gabor_lbp_hog_2048.mat') xlabels = {'prof','grad','grad-1','ugrad','ugrad-1','staff','sloan'}; groundTruthLabel = zeros(length(fbgTestImgs),length(fbgTrainImgs)); for i = 1:numel(fbgTestIds) for j = 1:numel(fbgTrainIds) if fbgTestIds(i) == fbgTrainIds(j) groundTruthLabel(i,j) = 1; end end end Part 2: clust = test_reality('call', 1, inf); lengthClust=length(clust); cHist = zeros(lengthClust,83); for c=1:lengthClust ind=clust==c; for i=1:83 cHist(c,i)=sum(groundTruthLabel(ind)==i); end end figure;imgsc(cHist) figure;imgsc(groundTruthLabel)
K-Means:(Still in progress) X = [fbgTrainImgs';fbgTestImgs']; k = 100; opts = statset('MaxIter’,10); [idx,ctrs] = kmeans(X,k,'Replicates',1,'options',opts); classes = unique(fbgTrainIds); trnCtrs = ctrs(1:length(fbgTrainIds)); trainHist = zeros(k,length(classes)); for c = 1 : k ind = trnCtrs == c; for i = 1 : length(fbgTrainIds) trainHist(c,i) = sum(ind & fbgTrainIds == i); end end tstCtrs = ctrs(length(fbgTrainIds)+1:end);