520 likes | 663 Vues
Functional genomics approaches to disease genomics. Biological information and organisation Genomics approaches to identifying disease-relevant enrichment Candidate gene approaches. Biological information increases rapidly. Everyday hundreds of articles are published We can’t read them all
E N D
Functional genomics approaches to disease genomics • Biological information and organisation • Genomics approaches to identifying disease-relevant enrichment • Candidate gene approaches
Biological information increases rapidly • Everyday hundreds of articles are published • We can’t read them all • We can’t remember them all • Our memories are subjective anyway • To make use of this incredible research output, we need some ways to bring this information together and summarise it • If we could make it readable by a computer then our power to use it increases hugely
OMIM • Online Mendelian Inheritance in Man (OMIM) is a catalog of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases • Annotates 325 genes associated with human disease • 2,710 disorders with a known molecular basis • 1,634 genetic disorders with an unknown basis • The OMIM entries are made by experienced annotators • Even the best annotators are not wholly consistent
What is Ontology? • Dictionary: A branch of metaphysics concerned with the nature and relations of being. • Barry Smith:The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality. 1606 1700s Slide from the GO website www.geneontology.org
Ontologies • Formalising our knowledge into a structured and defined vocabulary is essential for genomics approaches • The benefits from an agreed language enable rapid progress (e.g. Species classification) • Recently, biological research communities have been defining a common language for describing everything from protein function through to phenotype
From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things. is part of Slide taken from GO (www.geneontology.org)
Gene Ontology (GO) • The Gene Ontology project was set up to provide a controlled vocabulary that describes a gene and its products (principally its product) • GO describes genes in 3 separate ontologies • Molecular function, biological process and cellular location • Genes can be annotated with many terms in each category
Molecular Function GO term: Malate dehydrogenase. GO id: GO:0030060 (S)-malate + NAD(+) = oxaloacetate + NADH. Cellular Component GO term: mitochondrion GO id: GO:0005739 Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration. Biological Process GO term: tricarboxylic acid cycle Synonym: Krebs cycle Synonym: citric acid cycle GO id: GO:0006099 GO
GO Biological Process Is_a • Directed Acyclic Graph (DAG) • Allows a child node to have more than one parent Physiological Process Is_a Metabolism Is_a Is_a Primary Metabolism Is_a Biosynthesis Protein Metabolism Is_a Is_a Protein Biosynthesis
Mammalian Phenotype Ontology Really the mouse phenotype ontology Annotators take each published mouse gene knock-out experiment and annotate the phenotype with the MPO
Human Medical Ontologies • Human Phenotype Ontology www.human-phenotype-ontology.org • The HPO provides a standardized vocabulary of phenotypic abnormalities encountered in human genetic syndromes • London Dysmorphology Databasewww.human-phenotype-ontology.org Abn. of the cardiac septa Organ abnormality Cardiac malformation Cardiac abnormality Cardiovascular abnormality Abn. of the cardiac atria
Model Organisms • Excellent functional genomics resources • The comparison between a human phenotype and a mouse phenotype is often very readily interpretable. • Other useful organisms include the fly, the worm and even yeast • Useful as they have well-curated data for many genes
Kyoto Encyclopaedia of Genes and Genomes (KEGG) • Pathway database • manually-curated information from literature
High-throughput functional resources • Tissue-expression • Where and when genes are expressed may be relevant to the disease • Interactions • genes that interact may be involved in the same biological process • E.g. protein-protein interactions or genetic interactions (coordinated regulation) • Sequence patterns (coding or regulatory) • Similar sequence can infer common functionality
Different data sources have different types of error • Literature sources (GO, model organism data, etc) have poor coverage and a lack of true negatives • We publish “A is an X” more than “A is not a Y” • All genes have not been subject to the same studies • High-throughput sources often have high-error rates • False-positives are particularly a problem for gene/protein interactions when you’re considering all pairs
The value of mouse phenotypic data Ability to predict Human Phenotype Ontology terms
Forming interesting gene sets • If you can’t identify a single gene/loci, may be you can form a subset of genes likely to contain gene(s) of interest • Genes in large intervals identified by linkage studies • Genes near SNPs with low, but not genome-wide significant, p-values from GWAS studies • Genes in de novo or rare CNVs seen in cases • Power is important • Bringing together many similar cases enriches for disease genes associated with that disease
Testing for enrichments • Compare to the genome • Pulling balls (genes) from a bag (genome) is sampling without replacement, hypergeometric distribution • Compare to controls • If chosen well, may account for biases • Contingency tables, Chi2 tests • If controls are unavailable, you can randomise to help address potential biases like gene length and function
Rare de novo copy number variant (CNV) associated with learning disability 2.8 Mb 2.8 Mb How does this CNV relate to the etiology of the disease? Which gene(s) underlie the phenotype?
Rare de novo CNVs are frequent in learning disability Collect a list of 148 rare de novo CNVs • Rare de novo CNVs > 100kb present in ~10% of LD cases • Occur all over genome • 80% unique, non-recurrent
CNVs are common in all people Collect a list of 26,472 benignCNVs Redon et al. Nature 2006 Apparently benign, mostly inherited CNVs occur all over genome
Mutations at different loci can give a similar phenotype SYMPTOM/PHENOTYPE
Method Interesting intervals in patients Available Mouse KO phenotypes Mouse Genes Human Genes ORTHOLOGY Mouse models relevant to the human disorder Disease phenotype Significantly over-represented phenotype
Significant enrichments of genes associated with particular mouse phenotypes within de novo CNVs identified in patients with Intellectual disability 15 200 200 300 300 10 250 250 150 150 5 200 200 % change 150 150 over 0 100 100 expected 100 100 -5 50 50 50 50 -10 0 0 0 0 -15 Benign CNVs All LD CNVs LD CNVs - benign CNVs Loss LD CNVs Loss LD CNVs - benign CNVs * * * * * * * * * * * % change over expected Abnormal dopaminergic neuron morphology Abnormal axon morphology Nervous System category FDR < 5% *
Human brain-specific genes corroborates mouse findings * * “Brain-specific” genes are defined as those whose expression in human whole brain is > 4 x median expression across all other tissues Provides ~ 3.75% of human genes as “brain-specific” Benign CNVs * * All LD CNVs All LD CNVs minus benign CNVs Loss LD CNVs Loss LD CNVs minus benign CNVs Brain-specific Genes
Autism Spectrum Disorders – the ‘triad’ of symptoms Impaired communication Impaired social interaction Restrictive, repetitive behaviours and interests Autism.org.uk
Behavioural model phenotypes associated with Autism Spectrum Disorder (ASD) de novo CNVs “Difficulty processing and retaining verbal information” “Difficulty understanding social language” “Difficulty coping with changes in routine”
Behavioural model phenotypes associated with Autism Spectrum Disorder (ASD) de novo CNVs “Difficulty understanding social language” “Difficulty with empathy and friendships”
Behavioural model phenotypes associated with ASD de novo CNVs “Restricted and Repetitive Behaviours and Interests” 60-80% of individuals with ASD exhibit poor motor planning and coordination
Candidate genes • The genes that constitute significant enrichments become candidate disease genes • While the enrichment issignificantly associated with the intervals, the individual genes are not, and each requires further proof individually • Experimental follow-up is costly and thus the genes taken forward need to be considered carefully
Annotations vary in coverage and specificity Mouse phenotypes Abnormal Axon/Neuron GO Transcription Brain- Specific KEGG Neuro KEGG Parkinson’s Number of candidate genes % change over expected % of CNVs with a candidate gene
The better the patients are classified the more power we have to identify enrichments Tremor phenotype Benign CNVs Patients +/- seizures LD CNVs in 6 patients with cleft palate 142 without cleft palate Abnormal myelination phenotype Patients +/- brain abnormality Enrichment for KO phenotype cleft palate 6 of 148 LD patients have a cleft palate
Some associations found for the main cohort may be more relevant to associated, or co-occurring, symptoms – ASD
Mutation databases are a rich source of discovery: DECIPHER • DECIPHER is a database that holds genetic information about patients who present with congenital abnormalities Proband 1 Proband 2 Proband 3 Very similar phenotype Single gene
DECIPHER patients are annotated with London Medical Database terms Level 1 Level 2 Level3
Cranium, General abnormalities Formed groups CNVs associated with each human phenotype 7 CNVs 121 CNVs 18 CNVs ENSEMBL genes assigned to CNVs 132 CNVs 692 genes 3320 genes 3036 genes Remove copy number variable genes observed in healthy individuals 633 genes 3030 genes 2767 genes
Many enrichments are readily interpretable Human Symptom: Short Stature, Prenatal Onset Human Symptom: Cupid bow shape of mouth * * * Mouse Phenotype: Decreased Fetal Size Mouse Phenotype: Abnormal Palate Development Human Symptom: Malocclusion Human Symptom: Syndactyly of toes * * * Mouse Phenotype: Malocclusion Mouse Phenotype: Syndactyly Gain Loss * Statistically Significant FDR < 0.05 All
Others identify less obvious relationships Human Symptom: Psychotic Behaviour Human Symptom: Complex Partial Seizures * * * Mouse Phenotype: Abnormal pre-pulse inhibition Mouse Phenotype: Abnormal circadian rhythm KEY Gain Loss All * Statistically Significant FDR < 0.05
Mutations can be dissected to identify the contributions of individual genes Patient id: 248772 ATG7 OXTR ATP2B2 Intellectual disability/ developmental delay candidate genes Short stature, prenatal onset candidate gene FANCD2 Patient id: 785 Camptodactyly candidate gene SNX2 Mental retardation/ developmental delay candidate gene FBN2
Gene set enrichment analysis Aravind Subramanian et al, 2005 • Start with some list of ranked genes • Genes ranked by expression cases vs controls (Microarrays) • Genes ranked by nearby SNP p-values • Score genes + or – according to some property • Ask, are genes with this property more focussed towards the top of this list that I would expect by chance?
Gene Prioritisation for disease • Given a list of genes, which are most likely to be involved in this disease? • We just want a ranking, not a significant association • Commonly employed approaches involve supervised learning methodologies • Collect data points from one or more sources • Take a “Gold Standard” set of genes for this disease • Train a method using known true +ives (and true –ives if known) • Given a list of genes, which ones “look” most similar to the known disease genes?
Linkage networks can infer missing values – “guilt by association”
Linkage network for human disorders using the Human Phenotype ontology (PMID 18950739)
Conserved co-expression of disease genes (Ala et al. ,PLoS Genetics 2008) • 850 OMIM entries where a phenotype was mapped to a loci but specific genes unknown • Used conserved human-mouse co-expression data as other interaction or pathway data can bias towards studied genes • Generated single species gene co-expression networks • Calculated Pearson’s cor. coef. between all pairs of gene expression data. Formed a network edge if 2 genes’ exp. correlation was in the top 1% either gene. • Clustered OMIM phenotypes using MimMiner • A text-mining tool
Using this methodology, they were able to predict 321 candidates across 81 disease-associated loci at an FDR of <10%
Human phenome-interactome network for predicting disease candidate genes(Lage et al., Nature Biotech. 2007) • 2 data networks • Phenotypic similarity, consisting of detecting words that are common to two phenotype descriptions and do not occur frequently among all phenotype description. • Human interactome, consisting of several large human sets and sets transferred from model organisms, weighted according to observation frequency.
a given positional candidate is queried for high-scoring interaction partners (“virtual pull-down”). These are interaction partners for the candidate complex. • proteins known to be involved in disease are identified in the candidate complex, and pairwise scores of the phenotypic overlap between disease of these proteins and the candidate phenotype are assigned. • Based on the phenotypes represented in the candidate complex, a Bayesian predictor awards a probability to the candidate in the complex. The score is used to form the ranking.