180 likes | 340 Vues
Affymetrix Expression Data. Comics Group 12-05-2003 Nijmegen Tim Hulsen. General Information (1). Affymetrix oligo microarrays: HG-U133 A and B (human) and MG-U74v2 A, B and C (mouse) Updated every two months; releases used here: november 2002 and january 2003 UniGene-based
E N D
Affymetrix Expression Data Comics Group 12-05-2003 Nijmegen Tim Hulsen
General Information (1) • Affymetrix oligo microarrays: HG-U133 A and B (human) and MG-U74v2 A, B and C (mouse) • Updated every two months; releases used here: november 2002 and january 2003 • UniGene-based • Probes: 25mer oligos complementary to the sequences of interest • Probe pairs: perfect match (PM) probe and mismatch (MM) probe, MM is different from PM in the 13th position
General Information (2) • Human chips: 3269 samples, 44792 fragments, 115 tissue categories (114 for nov. 2002 release), 15 SNOMED tissue categories • Mouse chips: 859 samples, 36701 fragments, 25 tissue categories, 12 SNOMED tissue categories • Results from all samples within a tissue category are combined by generating electronic Northerns • For each tissue fragment and each tissue category is determined: • Median expression value • Present call (percentage)
Median expression value • Expression value: intensity • All expression values that have a ‘present call’ are used to determine the median expression value • Varies from 0 to ~65,000 in human and from 0 to ~97,000 in mouse
Present Call (Percentage) • Normalization/scaling procedures (MAS 5.0) are used to determine an expression intensity value with an associated confidence level to each fragment • When confidence level p for expression is smaller than 0.05, the expression intensity for this specific fragment in this particular sample is called present (P) • Call values are used to calculate a present call percentage (P calls / total calls)
Snomed category definitions (1) • SNOMED: Systematised Nomenclature of Medicine • Combines specific categories into more global categories, i.e. organ systems • In human far more useful than in mouse (115->15,25->12) • Categories like: cardiovascular system, digestive organs, endocrine gland, female genital system, male genital system, musculoskeletal system, nervous system, respiratory system, etc.
Snomed category definitions (2) • Example: cat. 7: ‘hematopoietic system’:
Annotation provided For each fragment, if available: • title • geneSymbol • geneAlias • exemplarAcc • omimId • snpId • refseqId • refseqprotId • ncbiNuclId • ncbiProtId • unigeneAcc • unigeneId • interproId • pfamId • swissprotId • goId • goFunction • goProcess • goComponent • comment
Goals & Problems • Goal: use data set to see if co-expression between orthologous/paralogous gene pairs is higher than between ‘unrelated’ gene pairs, in human & mouse • Problem 1: limited annotation • Problem 2: empty expression profiles • Problem 3: size of data set
Limited annotation (1) For example for three of the most used protein ids: ncbiProtId (in red), refseqProtId (in green), swissprotId (in blue) Human: Mouse:
Limited annotation (2) Solutions: • Smith-Waterman of (SWX) of all Affymetrix sequences to the human & mouse IPI sets, for which orthologs and paralogs were already defined -> IPI id added to database • Smith-Waterman (SWN) of all Affymetrix sequences to each other for better orthology/paralogy prediction
Empty expression profiles • Lots of genes have no expression at all in any tissue category • Useless for correlation calculation; two genes with no expression will have a top correlation! • For human: 4114 out of 44792 fragments completely no expression in all tissue categories -> 40678 left • For mouse: 6791 out of 36701 fragments completely no expression in all tissue categories -> 29910 left
Size of data set • Correlation between gene pairs is calculated: the number of pairs is (x2-x)/2 for x genes -> millions of data points • Number of gene pairs is already brought down by the ‘no expression gene removal’: in human from 1,003,139,236 to 827,329,503, in mouse from 673,463,350 to 447,289,095 • For some quick analyses, sets of e.g. 1000 randomly selected genes were used -> 499,500 gene pairs
Uncentered Correlation • ‘Uncentered’: from 0 to 1 • UC(X,Y)= Σ( X / ( sqrt ( Σ( X2 / N ) ) ) * ( Y / ( sqrt ( Σ( Y2 / N ) ) ) ) / N • Calculated correlations between gene pairs were used to see if the co-expression for orthologous pairs and/or paralogous pairs is higher than for ‘unrelated’ pairs • This was measured by using the KEGG Pathway map (release 25) • The best, however not completely convincing, result was found using PCP and not ME:
Correlation KEGG Pathway Check • Data points above a correlation threshold of 0,9 and 1,0 were left out because of very low numbers (irreliability) • Only orthologous conserved gene pairs have a higher accuracy when increasing the correlation threshold • May be a combination of PCP and ME should be used • Another measure could be used: same GO category instead of KEGG, GO is already annotated by Affymetrix • Lots of genes have only an expression value in one tissue; this correlation method is not really suitable -> mutual information analysis
Mutual Information • For each tissue category: 0 or 1 (ME/PCP value below/above a specified threshold) • x0 = % of 0s, x1 = % of 1s • x00 = % of 0-0 pairs, x01 = % of 0-1 pairs, x10 = % of 1-0 pairs, x11 = % of 1-1 pairs • Entropy per gene: -(x0*ln(x0)+x1*ln(x1)) • Entropy per gene pair: -(x00*ln(x00)+x01*ln(x01)+x10*ln(x10)+x11*ln(x11)) • MI = Entropy(1) + Entropy(2) – Entropy(1,2) • 0<=MI<=0,693147
MI GO Category Check • Mutual information check using GO Biological Process 3rd level of specification • Horizontal axis shows log(MI) • Different lines: different thresholds for defining as a ‘0’ or a ‘1’ • Accuracy indeed seems to be higher for pairs with much mutual information, but there is also a peak at -9<=log(MI)<-8 • Orthologous/paralogous pairs not checked yet
Future plans • Complete mutual information analysis, using both KEGG Pathway and GO databases; look at orthologous and paralogous gene pairs too • Check alternative splicing • Speed – license ends at the end of June