220 likes | 344 Vues
Mining Association Rules from Microarray Gene Expression Data. Association Rules. Form: LHS => RHS , where LHS and RHS are disjoint itemsets.
E N D
Mining Association Rules from Microarray Gene Expression Data
Association Rules • Form: LHS => RHS, where LHS and RHS are disjoint itemsets. • Frequent itemset: supportT(I) ≥ α, where supportT(I) is the number of transactions in T that contains all the items in I and α is the minimum support.
Objective Measurements of Rule Interestingness • Support of the rule LHS => RHS: Support(LHSU RHS), frequency that LHS and RHS occur together in a transaction. • Confidence of the rule LHS => RHS: Support (LHS U RHS)/Support (LHS), frequency that RHS occurs when LHS occurs.
Association Rule Discovery (ARD) algorithms • Originally used in market basket analysis • Unsupervised and domain independent • Example: Apriori algorithm (Agrawal, 1993) • Advantage: Finds all associations • Disadvantage: Very large number of associations
Microarray technology • Measurement of gene expression levels in cells • Simultaneous measurement of thousands of genes • Facilitates the study of gene interactions • Expression profile=set of values for different conditions • One condition = one slide • condition: tissue, treatment or time point • Data matrices
Rules Applied to Gene Expression • Each gene expression experiment is a single transaction and each gene is an item. • Gene value may be numerical, may need to be binned as being up (expressed), down (repressed), or neither. • Items in a gene expression transaction can also include relevant facts describing the cellular environment. • Find frequent itemsets: apply Apriori algorithm. • Generate association rules from the frequent itemsets.
Example - association rules • Database T1: A B C D E T2 : A B C D E T3 : A B C D E T4 : A B C D E T5 : A B C D E • Rules R1: A B C E (sup=0.40, conf=0.67) R2: A B D (sup=0.40, conf=1.00) R3: A B (sup=0.60, conf=1.00)
Rule induction algorithms G1 is Gene 1, S1 is sample 1, means overexpressed , and means underexpressed. G1 G2 G3 G4 Class S1 Cancer S2 Non-Cancer . . . . . . . . . . . . . . . . . . Sn Non-Cancer Association rule discovery algorithm * Rule set R1 : G1 G3 Cancer R2 : G1 G4 Non-Cancer R3: G1 G3 G2 Cancer
Forming Association rules • Any frequent itemset of size greater than one can be divided into two itemsets, LHS and RHS. • Using objective measures: If the confidence of a candidate rule exceeds a speficied minimum confidence criterion, the rule may be included. • A very large set of rules is usually generated.
Example • 28 treatments were recorded on 28 microarray chips– 28 transactions. • Each chip contains expression levels of approximately 6200 genes (items). • After Apriori algorithm, 70,000,000 rules were generated.
Subjective Measures for Selecting Rules • Select genes before finding frequent items – limit the number of items. • Limit the size of LHS or RHS E.g., LHS contains only one item. • Domain specific measures -- Use knowledge about the domain under study to specify patterns.
Rule Filtering and Group(Tuzhilin and Adomavicius 2002) • Rule templates – specify restrictions on the combinations of genes and their expression levels that can appear in the body and head of the rule. RulePartHASQuantifierOFC1, C2, ..., CN[ONLY] • RulePart: BODY, HEAD, or RULE. • C1, C2, ..., CN is a comparison set, representing a list of genes against which the discovered rules will be compared: • A gene, e.g., G17 • A gene with a particular expression level, e.g., G17↑ • A group (category) of genes, e.g., [DNA_Repair] • A group of genes with an expression level, e.g., [DNA_Repair] • A group of genes with a list of allowable or unallowable expression levels, e.g., [DNA_Repair] = {↑, #}
Rule Filtering con’t • Quantifier: a keyword or an expression specifying how many genes specified by C1, C2, ..., CN List have to be contained in RulePart. • ALL, ANY, NONE, specifying the number of genes from C1, C2, …, CN the RulePart must have • A numeric value; e.g., 2, specifying a rule must have exactly 2 genes from the comparison set • A range of numeric values; e.g., 1,3,5-7 • ONLY is used to indicate RulePart can have only the genes in the C1, C2, …, CN list.
Rule Filtering con’t • All rules that contain at least one of the following genes: G1, G5, G7: • RULE HAS (ANY) of G1, G5, G7, • Matching rules: G1↑ => G3↑. • "When genes involved in the DNA repair are upregulated, what other gene categories are also up- or downregulated?" • BODY HAS (ANY) OF [DNA_Repair] AND HEAD HAS (ANY) OF [All_Genes]={, }
Macro Templates (Tuzhilin and Adomavicius 2002) • Detect unexpected rules : CONTRADICT (GeneExprSet, G, ExpLevel)= BODY HAS (ALL) of GeneExprSet AND HEAD HAS (ALL) OF G ≠ {ExpLevel} • CONTRADICT({G1, G2},G4,} • Unexpected rule: G1G2G3 G4
Rule Grouping • Group similar rules together into classes to be analyzed • Gene hierarchy: group genes based on their functions. ALL F1 F2 G2 G3 G4 G5 G1
Rule Grouping con’t • Aggregated rules • Groups: F1={G1,G2,G3}, F2={G4,G5} • Rules: R1={G1G4}, R2={G1G5}, R3={G1G3G5} • Aggregated rule: R=F1F2, R'=F1F2 • {R1,R2,R3}R, {R1,R2}R', R3R'
Rule Derivation Procedure Some features may be irrelevant or a reduction may be required for efficiency reasons Database of transactions Feature selection A rule induction algorithm is applied to the database (e.g. association rule algorithm or decision tree algorithm) Reduced database Rule induction Initial rule set Rules are assessed and ranked using different measures of interestingness Rule selection Relevant rule set The rules need to be validated to be accepted as knowledge. Can be done by more detailed biological experiments in the context of gene expression data Rule validation Rules representing knowledge
Association Rules in Medical Data • Medical record data • Millions of “claims” for medical procedures • Each patient may have several claims • Each claim may have several line items or records, one for each procedure performed. • A diagnosis was reported with each procedure. • Data: patient code, procedure code, diagnosis code • Goal: discover relationships between procedure performed on a patient and the reported diagnosis.
Con’t • Data items: the set of all procedure and diagnosis codes (7,365 + 9,383=16,748) • Transaction: the set of procedure and diagnosis codes for each patient (1,257,645 patients) • May cause unexpected rule: {cast} => {heart disease} • Each itemset consists of one or more procedure/diagnosis codes. • The support of an itemset I is the number of patients whose set of items include all the items in I (> 1%).
Formulating Rules • Applying Aprior algorithm to generate all frequent itemsets. • Restricting to those frequent itemsets which contain both procedures and diagnoses • For each selected frequent itemset, one rule is formulated with all procedure codes on the left and all diagnosis codes on the right. • Computing confidence and eliminating rules with confidence less than 65%.