240 likes | 366 Vues
This study presents a comprehensive pipeline for annotating coding non-synonymous SNPs (nsSNPs) using various data sources. With over 9 million SNPs in dbSNP and limited functional annotations, our approach identifies candidate functional SNPs relevant to gene, haplotype, and pathway analyses. By employing rule-based and supervised learning methods, including SVM, we predict the impact of nsSNPs on protein stability, domain interactions, and ligand binding, thus facilitating the discovery of SNPs critical for disease and drug sensitivity studies.
E N D
-Bioinformatics April 2005 LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources
Motivation • Over 9 million snps in dbsnp with little functional annotation • nsSNPs are critical importance for disease and drug sensitivity • Prediction of functional snps enables targetting of snps to be genotyped in candidate gene studies • help identify causative snp within snps that are in ld
Aims • Identify candidate functional SNPs in • Gene • Haplotype • pathway • Map nsSNPs onto protein sequences, functional pathways, comparative structure models
Predictions of snp function • Predict positions where nsSNPs • rule based: • destabilize proteins, • interfere with formations of domain-domain interfaces • protein-ligand binding • supervised learning (svm): • severely affect human health
Methods - pipeline • SNP-protein mapping • Sequence to structure (exp derived) • genomic seq, protein seq, protein structure • SNP prediction annotations combine: • rule based • supervised learning (svm)
SNP Annotations-rule based • destabilizing (Sunyaev, et al., 2001) if: • RSA (rel solv access)< 25% and diff in accessible surface propensities (knowledge based hydrophobic potentials) > 0.75 • RSA>50% and diff in accessible surface propensities > 2 • RSA<25% and charge change • variant involves a proline ina helix
rule based (cont.) • Interference with domain-domain if: • any of 4 rules combined and • within <=6A of an atom in an adjacent domain • effect protein-ligand binding is predicted • any of 4 rules combined and • ligand-binding if <=5A of a HETATM • (not covalently bonded to the protein, not one of the 20 aa nor in a water mol)
(measure of strain) SNP Annotations-supervised learning (svm) (chemical similarity) • train svm to discriminate between mongenic disease nsSNPs from OMIM and neutral snps from dbSNP
svm – training dataset • 1457 disease-associated • VARIANTS in SWISS and OMIM • 2504 neutral • neutral VARIANTS according to rules 1-4 • 3-fold cross validation • train on subset 1 and 2 test on 3 • repeated 10 times
svm – training dataset • the absolute values gives confidence • exclude low confidence predictions • accuracy of 80.5%(+-0.3%) • false pos 19.7%(+-0.2%) • false neg 18.7%(+-0.8%) • 122 rejected on low confidence
Results-mapping • snp to protein mapping • 28,043 (21,255 dbSNP) validated coding nSNPs • 70,147 (54,048 dbSNP) incl non validated
Results-structure • 13,391(53%) proteins have modelled domains with equivalent residues • 13,062 (19%) nsSNPs (all) • 8725 (31%) nsSNPs (validated) • 67 nsSNPs appear in more than one protein (alt splicing)
Results -function • 1886 destablizing nsSNPs (structural rules (1-4)) • 1317 monogenic disease-associated nsSNPs by svm • comparative models • conservation • sub properties
Web resourcehttp://alto.compbio.ucsf.edu/LS-SNP/ • SCOP • swissprot • KEGG • UCSC • PDBSUM • MODBASE KEGG pathway,snp id(rs),hugo, swissprot filter
genomic seq protein seq
Discussion-data quality • validated/non validated snps? • multiple independent submissions • submitter confirmation • alleles observed in at least 2 chr • submision to hapmap • report non val and val snps with option to filter
Discussion -ligands • local structural env of each snp-ligand cannot be evaluated by the pipeline • all contacts reported • some will not be biologically interesting • eg snp in proximity of glycerol will have no functional effect • but, in glycerolkinase, the snp could be important
Discussion -structural annotations • ModSNP 4109 str annotations. 70% sequence identity cutoff • LS-SNP 13,062 dbSNP rsIDs (4907 validated) str annotations. No sequence identity cutoff- • instead, score given (0-1) based on seq identity and model assessment (avg identity ~28%)
Discussion -structural annotations • ‘…because structure annotations are models, use properties that depend on correct fold assignment and a good target template alignments opposed to atomic-level structural details such as loss of either salt bridges or hydrogen or disulphide bonds.’
Discussion -structural annotations • not possible to model effects such as changes in backbone geometry • or small side chain alterations
Case study-Glutathione S-Transferase • GSTs play key role in cellular detoxification • domain interface • buried charge change • unfavourable change in accessible surface potential at buried postion • conserved in mouse, rat,chicken • combination of info sources build convincing case
Caveats • only updated twice a year • dependant on structure (comparative modelling) • allowing predictions without structure data would have increased numbers • no option to add your own snps • no idea as to which predictors are best • combinations of predictors • domain-domain or ligand binding but no indication of how damaging this might be • next version will have hapmap snps • svm – monogenic • only chose small, subset of Sunyaevs rules - conservation?