Math 3346 Data Mining Presentation

Math 3346 Data Mining Presentation Yanting Ji 4465439 2009-10-29

Today I will talk about the paper: An Association Analysis Approach to Biclustering By Gaurav Pandey, Gowtham Atluri, Michael Steinbach, Chad L. Myers and Vipin Kumar Department of Computer Science & Engineering, University of Minnesota, Minneapolis, USA

Tour of the talk • Background • Range Support Patterns • Experimental Results • Summary • Comments

Background Data sets: represented by a matrix M, where its entries, Mij , is the value of item j in transaction i. (in this paper, microarray data set is used.) Bicluster: groups of items that show coherent values across a subset of transactions or example. Note: in case of clustering, we always use M rather than a sub-matrix of M.

Background Importance of the biclustering : • discover transcription modules from microarray data, which denote groups of genes that show coherent activity only across a subset of all the conditions constituting the data set. • reveal important information about the regulatory mechanisms operating in a cell.

Background Types of the biclusters • Constant value biclusters

Background 2. Constant row biclusters

Background 3. Coherent value biclusters (additive model)

Background 4. Coherent evolution biclusters

Background Recently used Algorithm : • ISA • SAMBA • CC • xMotifs • CTCWC • OPSM • LCD • Co-clustering techniques Now we turn to the Range Support Patterns (RAP)

Overview of CC algorithm • History: In 2000 Cheng and Church proposed a greedy heuristic algorithm to find biclusters in microarray data set, which consists of a set of genes having coherent expression values across a set of conditions. • The goal of this algorithm is to attain the lowest mean squared error (MSE) score by the following formula:

Overview of ISA algorithm ISA is short for Iterative Signature Algorithm. • This algorithm aims to find biclusters that consist of genes that show significant expressionindividually, and also a high degree of co-expression with each other over a group of conditions. Three steps : • In the first step, a group of genes G0 is chosen randomly, and a gene score of 1 is assigned to each of them. The condition scores for all the conditions are computed over these genes, and the conditions whose absolute score is greater than a user specified threshold tc are selected as C0. • In the second step, the gene scores for all genes are computed over these selected conditions and the genes with gene scores greater than a user specified threshold tg are selected as G1. • Above two steps are repeated until the algorithm converges to a group of genes Gn.

Range Support Patterns The Range Support Patterns consists of two parts: • The range support measure. • The algorithm.

Range Support Patterns Firstly, a support measure for real-valued data is defined. Formally, given a data set D consisting of a set of transactions T, which contains a value V(t,a) for each item a in each transaction t, and a range threshold d, the RangeSupport of a real-valued itemset I = {i1, i2, . . . , ik} is defined as RangeSupport(I) = where RS(t, I) is defined as

Range Support Patterns

Range Support Patterns • What’s more, we note that the RangeSupport measure is anti-monotonic. (proof omit) • Now we know that the RangeSupport measure emphasizes on two characteristics of real-valued data: • Range • Sign of values of the itemsets in a patterns.

Range Support Patterns Secondly, we turn to the Algorithm for finding range support patterns from real-valued data. • Since we defined the RangeSupport measure for real-valued data, that tries to ensure the coherence and sign of values in a group of items in a pattern, while maintaining the anti-monotonicity property, an Apriori-like algorithm is easily employed for finding range support patterns from a data set.

Experiment • It is time to evaluate the efficacy of range support pattern mining technique for finding coherent gene groups from microarray data, while results with those obtained from a similar analysis the CC and ISA biclusters are in the control groups. (Why choose the CC and ISA biclusters?) • Two major methodologies: • Evaluation using an objective measure of coherence, the mean square error (MSE) of the values in a bicluster. • Evaluation of biclusters in terms of functional coherence, i.e functional enrichment of patterns.

Experimental Results • The table below shows the details about the biclusters/patterns discovered using the RAP, CC and ISA algorithm by using different parameter settings. The size range and coverage numbers are computed only for the finally selected non-overlapping patterns.

Experimental Results Table :

Experimental Results Observations from the table: • it can be seen that the biclusters produced by ISA and CC generally contain larger number of genes than those found by RAP. • CC and ISA biclusters generally cover many more genes than RAP patterns. • the run time of RAP as compared to other biclustering algorithms, which is comparable to the ISA runs, but much faster than the CC.(This result actually is not that important) Note: only RAP3,RAP5,CC1,ISA2,ISA4 and ISA6 are chosen to perform later evaluation. Why?

Experimental Results First evaluation: (MSE) • Measure the coherence of each bicluster using the MSE score defined in The CC algrithm. • Analyze the distribution of these scores for all the sets of biclusters discovered by the different algorithms applied. Result is shown in the graph as follow:

Experimental Results

Experimental Results Some facts in the graph above: • Since the closer the distribution of scores for a set of biclusters is to zero, the closer they are expected to capture the constant row model • The scores for the range support patterns in RAP3 and RAP5 are almost all zero, with very few outliers. On the other hand, CC1 patterns have a much wider variability of these scores. • Hence we can conclude quantitatively that RAP is more likely to capture the constant row model than the CC.

Experimental Results The graphs below are sub-matrices of the data set corresponding to the biclusters from CC1, ISA2, RAP3.

Experimental Results Observation and Conclusion: • Although RAP3 has the highest MSE, it corresponding graph shows more coherence in its row than low MSE CC1 and ISA2’s graphs. • These three graphs qualitatively demonstrate the better ability of RAP pattern to find accurate type2 biclusters (constant row biclusters).

Experimental Results Second evaluation: functional enrichment of patterns • Methodology: determine what fraction of the patterns have a p−value smaller than a specified threshold for at least one of the functional classes in the consideration set. • Criterion: the lower this p−value, the more functionally enriched this gene group is with this class.

Experimental Results Since the size of groups can impact heavily on the P-value, all groups are divided into two categories: small classes with 1-30 elements, and large classes with 31-500 elements. Results are as follow:

Experimental Results

Experimental Results • By doing this evaluation, we can find that The RAP are quite useful discovering patterns that represent smaller functional classes rather than the larger biclusters.

Summary • An efficient framework RAP for directly mining association patterns from real-valued data sets is setup. • This algorithm is based on the novel anti-monotonic range-support measure. • Through comparison, RAP is quite useful for smaller functional classes and quite accurate for capturing the constant row model.

Comments • Only test the microarray data set. How about other data sets? • We know that a Apriori algorithm is computationally expensive, but the time cost of Apriori-like algorithm for RAP shown in the paper is quite good. It is better to demostrate details about how to modify the Apriori algorithm in this case. • In the ISA case, the initial value of each three sets of ISA chosen is 500, and it also says that the greater initial value is, the better result will be. I think some large initial values other than 500 can be used. • Comparison between RAP and CC in first evaluation is good, but the comparison between RAP and ISA seems to be meaningless. • As we can see, in the RAP case, the size of pattern is quite small. This implies that if there exist some large size of pattern, this method will be invalid.

Question and comments

Apriori Algorithm (From lecture notes) • Basic principle: Any subset of a frequent itemset must be frequent • Find the frequent itemsets: the sets of items that have minimum support • A subset of a frequent itemset must also be a frequent itemset. • If AB is a frequent itemset, both A and B should be a frequent itemsets. • Iteratively find frequent itemsets with cardinality from 1 to k. • Use the frequent itemsets to generate association rules.

Apriori Algorithm • Ck : Candidate itemset of size k • Lk : Frequent itemset of size k • L1 ={frequent items} • For (k = 2; Lk 6= 0; k + +) Ck = candidates generated from Lk−1 • For each transaction t belongs to D increment count of candidates in Ck contained in t Lk = candidates in Ck with at least min support.

P-value • p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The fact that p-values are based on this assumption is crucial to their correct interpretation. • The lower the p-value, the less likely the result.

The End

Math 3346 Data Mining Presentation

Math 3346 Data Mining Presentation

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining: Data

Math Presentation

Data Mining: P enelitian Data Mining

Data Mining

Presentation Title: DATA MINING

MaTH PRESENTATİON

Data Mining-Knowledge Presentation—ID3 algorithm

Data Mining: Data

Math Presentation

Data Mining: Data

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining and privacy Presentation

Data Mining: Data

Data Mining: Data