1 / 40

Mining Biological Data

Mining Biological Data. Jiong Yang, Ph. D. Visiting Assistant Professor UIUC jioyang@cs.uiuc.edu. Data is Everywhere. Data Mining is a Powerful Tool. Computational Biology E-Commerce Intrusion Detection Multimedia Processing Unstructured Data . . . Data Mining. Data. Knowledge.

julius
Télécharger la présentation

Mining Biological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Biological Data Jiong Yang, Ph. D. Visiting Assistant Professor UIUC jioyang@cs.uiuc.edu

  2. Data is Everywhere

  3. Data Mining is a Powerful Tool • Computational Biology • E-Commerce • Intrusion Detection • Multimedia Processing • Unstructured Data • . . . Data Mining Data Knowledge

  4. Biological Data • Bio-informatics have become one of the most important applications in data mining. • DNA sequences • Protein sequences • Protein folding • Microarray data • ……

  5. Outline • Approximate sequential pattern mining • Coherent cluster: clustering by pattern similarity in a large data set

  6. Frequent Patterns • Model • A set of sequences of symbols. • a1,a2,a4 • a2,a3,a5 • a1,a4,a5,a6,a7 • If a pattern occurs more than a certain number of times, then this pattern is considered important. • a1,a4 • Widely studied • Frequent itemset mining: Agarwal and Srikant (IBM Almaden) • FP growth: Han (UIUC) • Stream data: Motwani (Stanford) • …

  7. Apriori Property • Widely used in data mining field • It holds for the support metrics • All patterns form a lattice. • (a, b, d) is a super-pattern of (a, d) and it is a sub-pattern of (a, b, c, d). • Support metric defines a partial order on the lattice. • Support(a, b, d) <= min{Support(b, d) , Support(a, d) , Support(a, b) } • Level-wise search algorithm can be used

  8. Shortcomings • Require exact match and fail to recognize possible substitution among symbols • Protein may mutate without change of its functionality. • A sensor may make some mistakes • Different web pages may have similar contents. • A word may have many synonyms. • How can the symbol substitution be modeled

  9. observed d1 d2 d3 d4 d5 true d1 0.9 0.1 0 0 0 d2 0.05 0.8 0.05 0.1 0 d3 0.05 0 0.7 0.15 0.1 d4 0 0.1 0.1 0.75 0.05 d5 0 0 0.15 0 0.85 Compatibility Matrix Compatibility matrix of 5 symbols

  10. Compatibility Matrix • The compatibility matrix serves as a bridge between the observation and the underlying substance. • Each observed symbol is interpreted as an occurrence of a set of symbols with various probabilities. • An observed symbol combination is treated as an occurrence of a set of patterns with various degrees. • Obtain the compatibility matrix through • empirical study • domain expert

  11. Match • A new metric, match, is then proposed to quantify the importance of a pattern. • The match of a pattern P in a subsequence s (with the same length) is defined as the conditional probability Prob(P| s). • The match of a pattern P in a sequence S is defined as the maximal match of P in every distinct subsequence in S. • A dynamic programming technique is used to compute the match of P in a sequence S

  12. Match • M(d1d2…di, S1S2…Sj) is the maximum of M(d1s2…di, S1S2…Sj-1) and M(d1d2…di-1,S1S2…Sj-1) x C(di, Sj) • The match of a pattern P in a set of sequence is defined as the sum of the pattern P with each sequence. • A pattern is called a frequent pattern if its match exceeds a user-specified threshold min_match. S S p d1 d3 d4 d1 max 0.9 0.9 0.9 S d1 0.9 p p d2 0.045 0.09 0.09

  13. Challenges • Previous work focuses on short patterns. • Long patterns require a large number of scans through the input sequence. • Expensive I/O cost • Performance vs. Accuracy • Probabilistic Approach

  14. Chernoff Bound • Let X be a random variable whose range is R. Suppose that we have n independent observations of X and the observed mean is . The Chernoff bound states that, with probability (1- ), the true mean of X is at least  - , where • With probability (1- ), the true value of X is at most  + .

  15. Approach • Three-stage approach to mine patterns with length l: • Finding Match of Individual Symbols and Take a Sample set of sequences • Pattern Discovery on Samples • Ambiguous Patterns Determination • Pattern Discovery on Samples • Sample size: depending on memory size • Based on the samples, three types of patterns are determined.

  16. Approach • Frequent pattern if match is greater than (min_match +) • Ambiguous pattern if match is between (min_match - )and (min_match + ). • Infrequent pattern otherwise;

  17. Ambiguous Patterns • Ambiguous Patterns • Too many • Border collapse • We have the negative and positive borders of significant patterns. • Our goal is to collapse the border as fast as possible.

  18. Ambiguous Patterns (d1,d2,d3,d4,d5) (d1,d2,d3,d4) (d1,d2,d3,d5) (d1,d2,d4,d5) (d1,d3,d4,d5) (d1,d2,d3) (d1,d2,d4) (d1,d2,d5) (d1,d3,d4) (d1,d3,d5) (d1,d4,d5) (d1,d2) (d1,d3) (d1,d4) (d1,d5) (d1)

  19. infrequent (d1,d2,d3,d4,d5) (d1,d2,d3,d4) (d1,d2,d3,d5) (d1,d2,d4,d5) (d1,d3,d4,d5) (d1,d2,d3) (d1,d2,d4) (d1,d2,d5) (d1,d3,d4) (d1,d3,d5) (d1,d4,d5) (d1,d2,) (d1,d3) (d1,d4) (d1,d5) (d1) frequent Ambiguous Patterns

  20. Effects of 1- Without Border Collapse With Border Collapse

  21. Approximate Pattern Mining • Reference: • Mining long sequential patterns in a noisy environment, Proceeding of ACM SIGMOD International Conference on Management of Data (SIGMOD),pp. 406-417, 2002. • Other Work • Periodic Patterns (KDD2000, ICDM2001) • Statistically significant Patterns (KDD2001, ICDM 2002)

  22. Outline • Approximate sequential pattern mining • Coherent cluster: clustering by pattern similarity in a large data set

  23. . . . Coherent Cluster • In many applications, data can be of very high dimensionality. • Gene expression data • Dozens to hundreds conditions/samples • Customer evaluation • Thousands or more merchants • Objective: discover peer groups attributes a1 . . . aj . . . o1 . . . oi dij objects

  24. 17 conditions 40 genes

  25. Coherent Cluster

  26. 40 genes

  27. Coherent Cluster Co-regulated genes

  28. Coherent Cluster • Observations: • If mapped to points in high dimensional space, they may not be close to each other. • Bias exists universally. • Only a subset of objects and a subset of attributes may participate. • Need to accommodate some degree of noise. • Solution: subspace cluster, bicluster, coherent cluster

  29. Subspace cluster • CLICK: Argawal et al IBM Almaden • Find a subset of dimensions and a subset of objects such that the distance between the objects on the subset of dimensions is close. • The clusters may overlap • Proclus: Aggawal et al IBM T. J. Watson • Do not allow overlap

  30. Bicluster • Developed in 2000 by Cheung and Church • Using mean squared error residual • After discovering one cluster, replace the cluster with random data and find another • Not efficient and not accurate

  31. Coherent Cluster • Coherent cluster • Subspace clustering • Measure distance on mutual bias • pair-wise disparity • For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b} dxa dxb x x dya dyb y y mutual bias of attribute a mutual bias of attribute b a a b b attribute

  32. Coherent Cluster • A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to . • An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster. • A -coherent cluster is a maximum-coherent cluster if it is not a submatrix of any other -coherent cluster. • Objective: given a data matrix and a threshold , find all maximum -coherent clusters.

  33. Coherent Cluster • Challenges: • Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. • The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. • The actual values of the objects in a coherent cluster may be far apart from each other. • Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

  34. Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Compute the maximum coherent object sets for each pair of attributes Two way pruning Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters

  35. 7 o1 5 3 o2 1 a1 a2 a3 a4 a5 3 2 3.5 2 2.5  [2, 3.5] Coherent Cluster • Observation: Given a pair of objects {o1, o2} and a (sub)set of attributes {a1, a2, …, ak}, the 2ksubmatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than . If  = 1.5, then {a1,a2,a3,a4,a5} is a coherent attribute set (CAS) of (o1,o2).

  36. 7 r1 5 7 3 r2 r1 5 1 3 r2 1 a1 a1 a2 a2 a3 a3 a4 a4 a5 a5 3 3 2 2 3.5 3.5 2 2 2.5 2.5 Coherent Cluster • Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .  = 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

  37. Two Way Pruning (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) delta=1 nc =3 nr = 3 MCAS MCOS

  38. Coherent Cluster • High expressive power • The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. • Efficient and highly scalable • Wide applications • Gene expression analysis • Collaborative filtering traditional clustering coherent clustering

  39. Coherent Cluster • References: • Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE),pp. 517-528, 2002. • Clustering by pattern similarity in large data sets, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD),pp. 394-405,2002. • Enhanced biclustering on expression data, Proceedings of the IEEE bio-informatics and bioengineering (BIBE), 2003. • Other Work • STING (VLDB1997) • STING+ (ICDE1999, TKDE 2000) • CLUSEQ (CSB2002, ICDE2003) • Cluster Streams (ICDE2003)

  40. Remarks • Similarity measure • Powerful in capturing high order statistics and dependencies • Efficient in computation • Robust to noise • Clustering algorithm • High accuracy • High adaptability • High scalability • High reliability

More Related