Mining SNPs from EST Databases

Mining SNPs from EST Databases Picoult-Newberg et al. (1999)

Strategy to Identify SNPs • Used ESTs derived from 19 different cDNA libraries to assemble 300,000 distinct sequences and identified 850 mismatches. • ESTs are short single pass cDNA sequences generated from randomly selected library clones. • EST contigs were generated using Phrap, a sequence alignment and contig assembly program. (Picture of EST alignment) • To distinguish between true SNPs and artifacts within the cDNA, a series of filters were applied.

Filter Types Used • Filter Test: 100 contigs were randomly selected and inspected with Consed after filtration. Candidate SNPs were considered real if trace data was of high quality and passed all four filters. • Filter 1: Eliminates clusters of mismatches that often occur in regions of low-quality trace data. • Filter 2: Identifies sequence mismatches by either base substitution type or insertion/deletion type. • Filter 3 & 4: Addressed quality of each base call relative to its position and frequency in a contig. First 100 bases discarded. Consed view of a contig containing a high quality mismatch (A vs. T). The mismatch has been confirmed as a common SNP.

Filter Types Used • Focused primarily on base substitutions (A/G or T/C) as these types make up greater than 60% of all polymorphisms

Conclusions • Successfully demonstrated a strategy for the rapid identification and verification of SNP-based genetic markers using EST data sources. • Possible that this approach could identify sequence variants that lead to amino acid substitutions that may lead to functional differences. • To show effectiveness of this strategy, the throughput of SNP confirmation was increased by using Genetic Bit Analysis (GBA).

Overlapping Genomic Sequences: A Treasure Trove of Single-Nucleotide Polymorphisms Taillon-Miller et al. (1998)

SNPs in Overlapping Clones • Developed strategy to identify SNPs in overlapping BAC clones. (Picture) • Sequenced three sets of overlapping clones, 153 polymorphisms (1 per 1.3 kb) were discovered to be unique (substitution, insertion or deletion). • 55 were discarded by computer analysis (filters). • The oligonucleotide selection program (osp) was used to design primers to amplify the 98 remaining SNPs. • 30 SNPs were in regions with no suitable amplimers. • In all, 44 STSs were developed to amplify the remaining 68 SNPs, with 16 STSs containing 2 SNPs and 8 STSs contained 3 or more SNPs. All 68 SNPs were present in at least one of three populations studied.

SNPs in Overlapping Clones 30 individuals

Conclusions • Informative SNPs can be found by using overlapping regions of clones because BACs are typically from different individuals. • They described a marker every 4.8 kb. • Many of the SNPs have different population frequencies • This approach has many advantages because 1) high quality sequence data because every base in overlap is sequenced at least twice; 2) SNP data is generated by analyzing existing data; 3) SNPs are derived from long range sequence data and these markers are precisely mapped. • SNP markers can be found in overlapping genomic sequence, it is highly efficient and cost effective. Only two steps involved (develop STSs around known SNPs and characterize them in populations).

An SNP map of the human genome generated by reduced representation shotgun sequencing Taillon-Miller et al. (1998)

Reduced Representation Shotgun Sequencing • RRS re-samples specific subsets of the genome and compares the resulting sequences using a highly accurate SNP detection algorithm. • Prepare subsets of the genome (reduced representation), by performing restriction digest to generate restriction fragment lengths between 500 to 600 bp. (Picture) • Computer analysis of 517 megabases yielded 3847 BglII fragments. • Thus SNPs can be discovered by mixing DNA from many individuals, prepare library of sized restriction fragments, and randomly sequencing clones. • Defined a neighborhood quality standard (NQS) to cut down on base calling errors. Good sequence results in higher accuracy.

Reduced Representation Shotgun Sequencing • Used polymorphism free BAC DNA to evaluate NQS. Any ‘SNPs’ would represent base calling errors. • Base calls within ‘good neighbors’ were more accurate than predicted by Phred.

Conclusions • Demonstrated a new method for re-sampling loci without PCR, and for detecting SNPs with high accuracy in the resulting alignments. • RRS can be configured for any organism. • SNPs discovered by RRS offer the potential for reduced representation genotyping without locus specific amplification.

Mining SNPs from EST Databases

Mining SNPs from EST Databases

Presentation Transcript

SNP Resources: Finding SNPs Discovery and Databases

Mining Multiple Private Databases

Efficiently Mining Long Patterns from Databases

SNPs

SNP Resources: Finding SNPs, Databases and Data Extraction

SNPs

Integrating Mining With Databases

Mining Association Rules in Large Databases

SNPS assessed

Imputing HLA Alleles from SNPs

SNPs

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Mining Multidimensional Databases

SNP Resources: Finding SNPs, Databases and Data Extraction

Mining Time-Series Databases

Mining Frequent Itemsets over Uncertain Databases

SNPs

SNP Resources: Finding SNPs Databases and Data Extraction

Mining Multidimensional Databases

TreeGeneBrowser: phylogenetic data mining of gene sequences from public databases