National Genetic Trait Index Update on grape pilot project

National Genetic Trait Index Update on grape pilot project Next-Generation sequencing to sample diversity Genotyping the germplasm collection Doreen Ware USDA ARS NGWI April 27, 2009

Outline • Background on the National Genetic Trait Index ( NGTI) • Grape Project objectives • Step 1: Next-Generation Sequencing to sample diversity • DNA preparation, sequencing method and analysis of sequencing reads for variation • Characterization of SNPs: position, allele support, and coverage • 10k SNP array development • Step 2: Genotyping the germplasm collection • SNPs identification • Preliminary results of the array • Phenotyping

What are germplasm collections? • The culmination of thousands of years of selection and improvement of plants • Our richest genetic heritage • The central resource for feeding and fueling the world • A resource from the past that we must pass on to the future in an improved state

What do we currently know? • Multiple functional variants per gene = alleles • 20,000 to 50,000 genes • Most traits product of 100s of genes • Many possible genetic combinations • Over the last 10,000 years, we have tested only a limited set of genetic combinations • Need a rational plan to organize and use this diversity • Genetics and Breeding

What do we want to do? We want to make more useful plants by conserving, finding and combining better alleles. The National Germplasm conserves 464,000 accessions and may contain 100,000,000 distinct alleles, but there is no index.

Stakeholder View

Current Variety DGL 2343 +0 +3 -2 PI 265443 +2 -1 +3 PI 532443 +1 +0 +2 +3 +0 +2 PI 783472 +0 +3 -1 PI 572811 Yield (CA) Disease Resistance Flavor Available Germplasm Although poor yielding, it has complementary yield alleles and good disease resistance and flavor Although good yielding, the current line already captures these alleles Absolute View Bad Allele Good Allele Neutral Allele

+9 +4 +5 +3 Yield (CA) Disease Resistance Flavor Current Variety DGL 2343 +0 +3 -2 Available Germplasm PI 265443 +2 -1 +3 PI 532443 +1 +0 +0 +2 -1 +1 PI 783472 +0 +0 -1 PI 572811 Contrast View Bad Allele Good Allele Neutral Allele

Current Variety DGL 2343 +0 +3 -2 PI 265443 +2 -1 +3 PI 532443 +1 +0 +2 Yield (CA) Disease Resistance Flavor Available Germplasm +7 +7 +6 Optimal Result Absolute View Bad Allele Good Allele Neutral Allele

Impact

Identify our most important and representative germplasm • Focus curators, security, and breeding efforts to the most important germplasm • We would know what is genetically feasible with natural variation • Biosecurity • Rapidly respond to pathogen introductions • Identify novel alleles and facilitate marker assisted breeding • Accelerate breeding results • Make US Agriculture Competitiveand Open New Markets

NGTI Grape Germplasm • Pilot project to demonstrate the feasibility of genotyping diverse NPGS germplasm collection for a species with more limited genomic resources • Provide markers for improved curation of the grape collection and help breeders and geneticists unleash the genetic diversity of grapes

Grape • Contains over 60 species mostly found in temperate regions of the northern hemisphere • Vitis vinifera is the most important domesticated species cultivated for table grapes and wine making • The wild grape Vitis sylvestris is considered the progenitor of the domesticated grape • High nucleotide diversity (π=0.004), highly heterozygous and low LD (~200bp)

Cluster Density Cluster Size Genetic Diversity in the Domesticated Grape Genetic Diversity ? Berry Shape Berry Size

Grape Diversity Project • Identify SNPs using a high-throughput sequencing approach • Select 10,000 informative SNPs and establish a genotyping chip for: • Genotyping the USDA grape germplasm repository • 1200 Vitisvinifera + 1000 wild Vitis samples • Study the population structure of Vitis • Patterns of shared polymorphism between Vitis vinifera and wild species • Create a SNP preliminary panel for association studies • Pilot project for developing informatics resources for SNP discovery in other high-diversity crop species

Team • Edward Buckler and Sean Myles – Genomics and statistical analysis • Doreen Ware, Jer-Ming Chia, Bonnie Hurwitz – Bioinformatics • Charles Simon, Gan-Yuan Zhong, Mallikarjuna Aradhya, Bernard Prins – Germplasm • Leon Kochian- oversight

National Genetic Trait Index Project: Grapevine Step 1: Discovery of genetic variants (SNPs) Make data available Integrate SNP data into public grape genome browser Diverse Samples 10 cultivated Vitis varieties (Vitis vinifera) 6 wild Vitis species 60 million sequences Total: 2 billion base pairs of sequence Discovery of >1 million SNPs Genome complexity reduction Digestion with HpaII restriction enzyme Illuminia/Solexa sequencing Sequencing by synthesis

Ehrenfelser French Colombard Gewurztraminer Kadarka Malvasia Muscat of Alexandria Pinot Noir Plavac Mali Thompson Seedless White Riesling Vitis amurensis Vitis cinerea Vitis labrusca Vitis palmata Vitis rotundifolia Vitis sylvestris Inbred Pinot Noir (Reference Genome) SNP Discovery Panel • Goal: Capture recent variation in domesticated grape as well as more ancient alleles in wild species • Solexa libraries constructed from 10 domesticated cultivars and 6 wild species

Library Construction ProtocolReducing the complexity of the Genome DNA Extraction Solexa Genome Analyzer Whole Genome Amplification* Ligation of Solexa Adaptors Genome Complexity Reduction: Restriction enzyme digest Addition of ‘A’ Base to 3`ends Size Selection from Gel: 100-600bp

Reduced Representation Libraries HpaII site HpaII site HpaII site ACTATCTATCCGGTCGCTAGCCGTATATCGGTATAGCTTCGGTCCGGTCATCGATTAGCCTAGCTCGATCGCTTACCGGTAGGACTGCTTCGA CGGTCATCGATTAGCCTAGCTCGATCGCTTACCG CGGTAGGACTGCTTCGA ACTATCTATCCG CGGTCGCTAGCCGTATATCGGTATAGCTTCGGTCCG Solexa sequencing

Image files from Solexa GA Ungapped Alignment Read Mapping Sequence and Base Quality Firecrest, Bustard NO Base Calling Gapped Alignment Mapped to genome? YES Sequence and Base Quality Alignments Data Storage Aln Consensus & Quality Variation Discovery Variation Data Accessibility Filters Variation Discovery Called SNPs Next-Generation Sequence Analysis Workflow

Building a Pipeline • Modular components allow for different mapping strategies • Mapping only non-redundant, non-singleton reads • Gapped vs. un-gapped alignment • Customizable SNP Filtering • Quality and probability filters • Read coverage accessed by technical or biological replicates • Tighter Data Control • Interim data, procedural and analysis results are stored • Allows for easy rollbacks and efficient re-analysis of the data using different parameters • Increment data can be added and analyzed without having to re-run the entire pipeline

Deciphering Genetic Diversity From High-Throughput Sequencing

Variation Discovery • Retrieved reads that carried a variant allele as compared to the reference genome • Initial pass for variation has very loose thresholds • Minor allele frequency >= 0.05 • Allele frequency is approximated by read counts • Bi-allelic • Alleles showing up with > 2% frequency is considered informative. A SNP with 3 or more informative alleles frequency are considered non bi-allelic • Total read count for the SNP >=10 but <=1000 • 469,470 potential SNPs

Selecting 10K SNPs for Array • Selected 10,000 SNPs for constructing an Illumina Infinium assay • On top of SNP quality, also considered: • Segregation patterns: Select SNPs that are supported by homozygous and heterozygous samples • Homozygous criteria: • More than 5 reads supporting the allele • Heterozygosity test: • Simple binomial test applied to the reference and alternate read counts in a single sample. • Probe design • Specificity -> minimize cross-hybridization • Took 50bp on both sides of each SNP and matched against genome (blast) • Disregard the flanking region if it matches to another location with < 2 mismatches within the first 10bp and < 5 mismatches in total • Sensitivity • SNPs within the probe sequence might cause assay to fail, so disregarded flanking region if another SNP is found within 10bp

10K SNPs Consequence within Genomic Sequence • SNP consequence data facilitated via the integration of SNP calls with the genome annotation through Ensembl • Selected 10K SNPs enriched for genic SNPs. • In contrast, genome is 46% in genic space, 41% repetitive/transposable elements

10K SNPs: Segregation Patterns

Step 2: Genotyping the grape germplasm repository • Analyses • Establish core germplasm collection • Identify synonyms and homonyms • Association mapping • Estimate population genetic parameters SNP selection Choose 10,000 high quality SNPs from the 500,000 Solexa SNPs 10K SNP chip Production of custom 10,000 (8898) SNP genotyping array 21 million genotypes • Genotype the germplasm repository • 1200 cultivated species (Vitis vinifera) • 1000 wild species

Genotyping the Collection • 10K array 8898 SNPs genotypes represented • Mean concordance among replicates is 98.8% • Of which 5500 SNPs are showing results (62%) • 515 accessions • ~192 samples a week should be complete by the end of July

PCA analysis of array scored SNPs show clustering of the different germplasm

PCA are able to discriminate between the wild variety

Outcomes • Genotyped Germplasm Collection • GRIN will have a real dataset to work with • Facilitate better curation • Allow breeders to estimate breeding values for entire germplasm collection • Background to initiate detailed phenotypic evaluation of germplasm and understanding genes underlying key traits

Phenotypes Pilot Phenotyping Key Secondary Metabolites Geneva, NY and Davis, CA: Gan-Yuan Zhong

Phenotyping Key Secondary Metabolites of Grapes • Phenotyping the USDA-ARS Vitis collections will be the next critical step for maximizing the value of the current genotyping effort • A pilot project has been initiated for phenotyping key secondary metabolites of the Vitis collections from both Davis, CA and Geneva, NY • About 400 V. vinifera and 200 North American collections will be phenotyped for 50 various phenolics including anthocyanins 525nm 365nm 280nm Profiling anthocyanins (525 nm) and other phenolics in grapes (HPLC-DAD chromatograms)

Past and Current Work Sample collection Grape germplasm repository, Davis, CA Laboratory and Analyses

Past and Current Work

Team • Edward Buckler and Sean Myles – Genomics and statistical analysis • Doreen Ware, Jer-Ming Chia, Bonnie Hurwitz – Bioinformatics • Charles Simon, Gan-Yuan Zhong, Mallikarjuna Aradhya, Bernard Prins – Germplasm • Leon Kochian- oversight

Mapping Statistics of reads from each of the germplasm to the reference vitis genome

Repetitive region of grape SNPs called SNP calling protocol Reference sequence Mapped Solexa Reads Variation, Frequency, Depth T C 2 3 A G 3 3 A T 1 1 A G A G

Overview of the Solexa SNP pipeline 56 Million reads (1.8 billion bp) are aligned to the reference genome The divergence within V. vinifera and with other Vitis is so great we need to develop other algorithms to map the reads 1.1 Million regions of the genome have potential SNPs, which are statistically evaluated for genotypic basis. 50,000 high probability SNPs are identified Empirically validating a small subset of the data. With improved algorithms and increased knowledge of grape diversity, we may be able to extract 100,000s of SNPs.

10K SNP Chip

No quality filter Ave. Base Qual >=10 Ave. Base Qual >=20 Number of reads Position on Solexa read at which variant allele is found SNP Filtering Using Base Quality • Uneven distribution of SNPs based on position in sequencing read • Tail end of Solexa reads have higher error rates - contributing to false SNPs • Filtering out SNPs where the average base quality < 20

Characterization of SNPs • Position in the read • Contingency test for allele • Frequency observed in different accessions • Depth of coverage by number of reads

Fisher’s exact p-value < 1e-16 Fisher’s exact p-value = 0.05 Using Fisher’s Test as a filter • Read counts of a particular SNP represented as a contingency table • Fisher’s exact test used to test the independence of rows and columns in a contingency table - used as a metric to evaluate segregation of alleles

p-value <= 0.1 No filter p-value <= 0.05 p-value <= 0.01 Number of reads Position on Solexa read at which variant allele is found SNP Filtering using Fisher’s Test • Accept only SNPs with a p-value <= 0.01 as high confidence

No. of samples vs SNP quality • SNPs backed by reads from more cultivars+wild species have better quality

Number of reads supporting a SNP vs SNP quality • Quality of SNP plateaus out after 150 read support

National Genetic Trait Index Update on grape pilot project