Exploring Human Genome Structural Variations for Disease Understanding

Chapter 6: Structural Variation and Medical Genomics CS-6293 Bioinformatics Instructor: Dr. JianhuaRuan Presented by: Nesthor Perez

Outline Nesthor Perez

1. Introduction • Based on the genetic every single human has different genomes. • Based on each genome there’s special trait for diseases. • GWAS identified common germline. • DNA variants are associated to: diabetes, heart deseases, and other deseases. • GWAS only explained fraction of heritability of traits. Nesthor Perez

1. Introduction Every single person: Based on each person genetic and genomes, special trait are applied for each disease. Has a different genome sequence: Nesthor Perez

1. Introduction • Cancer Genome Sequencing Studies identified Somatic Mutations associated with cancer progression. • This mutations are very heterogeneous. • Few mutations are common between patients. • Hard to associate mutations to cancer causes. • Comprehensive studies involve “all variants”. Individual genomes are req for each case. Nesthor Perez

1. Introduction • GWAS focus on Single Nucleotide Polymorphism: every single human genome is unique. • Previously Germline Variants identified SCALES ranging of DNA sequences: SNP’s  Structural Variants • Examples: • Duplications. • Deletions. • Inversions. • Translocations. Nesthor Perez

1. Introduction • Then, GWAS identified common Single Nucleotide Polymorphism SNP’s: • Common SNP’s for common diseases (similarities). • Common Variants between diseases (differences). • Main purpose: Disease Association and Cancer Genetics Studies. • In the last 5 years, DNA sequence next-generation technology become commercially available to companies: • Illumina • Life Technology • Complete Genomics Nesthor Perez

1. Introduction Chromosome components: Nesthor Perez

1. Introduction A reference genome range from SNPs to Stuctural Variants: Nesthor Perez

1. Introduction In the last 5 years, these companies develop sequencing technology: Consequently DNA cost decreased Nesthor Perez

1. Introduction • Consequently the cost of DNA practice has decreased. • DNA at low cost, the study of all variables is possible. • All variables: • Germlines. • Somatics. • SNP’s (Single Nucleotide Polymorphism). • SV’s (Structural Variants). • This paper talks about these sequence technologies, especially on Structural Variables: SV’s. Nesthor Perez

2.1 Germline Structural Variation • Human Genetic Study has a big purpose: Identify a unique DNA sequence • Attempts: • Identify common SNP’s (HapMap project). • Whole-Genome Seq & Micro-Array measurement found similar SV’s for: • Duplications • Deletions • Inversions • Then, common SV’s are now linked to: • Autism • Schizophrenia Nesthor Perez

2.1 Germline Structural Variation Human Genetics Study purpose: Identify a unique DNA sequencing. Steps: Whole-Genome Seq and Micro-Array measurement found similar SVs through: - Duplications - Deletions - Inversions Large DNA seq Identify common SNPs Nesthor Perez

2.2 Somatic Structural Variation • Cancer: driven by somatic mutations accumulated in life: “Micro Evolutionary Process”. • Early studies in Leukemia and Lymphoma. • Identified as “Recurrent Chromosomal Rearrangements”. • Present in many patients with the same cancer. • DNA sequence Next-Generation reconstruct how cancer genomes are organized at single nucleotide resolution. Nesthor Perez

2.3 Mechanisms of Structural Variation • Base on the amount of sequence similarity (homology) at the breakpoint of SV’s, there are two mechanism: • NHEJ: Non-Homologus End Joining: • Little or no sequence similarity. • NAHR: Non-Allelic Homologous Recombination: • High sequence similarity. Nesthor Perez

2.3 Mechanisms of Structural Variation CytogeneticTechniques: Chromosome Painting: Nesthor Perez

2.3 Mechanisms of Structural Variation CytogeneticTechniques: Nesthor Perez

2.3 Mechanisms of Structural Variation CytogeneticTechniques: Fluorescent in Situ Hybridization (FISH): Nesthor Perez

(FISH) Nesthor Perez

3. Technologies for Measurement of Structural Variation • SV’s features are based on: • Size. • Complexity. • Ranging: from hundred of nucleotides to large scale of chromosome rearrangements. • Cytogenetic Techniques: • Chromosome Painting. • Spectral Karyotyping (SKY). • Fluorescent in Situ Hybridization. (FISH) Nesthor Perez

3. Technologies for Measurement of Structural Variation • Large SV’s can be observed on CHROMOSOMES: Nesthor Perez

3.1 Microarrays • This technology was used for the first genome-wide survey in 2004. • This technique apply the concept of “array Comparative Genomic Hybridization: aCGH. • Reference genome are identified by a fluorescent color. • By now, there are hundreds of thousands of probes avaiables. • Since individual copy number ratios are subject to experimental errors, computational techniques are required to analyze aCGH. Nesthor Perez

3.1 Microarrays Nesthor Perez

3.1 Microarrays • aCGH can be used to measure both: germline SV’s in normal genomes and somatic SV’s in cancer genomes. • aCGH initially was developed for cancer genomics applications. • aCGH now is also used to detect copy number variants in large number of genomes at low cost. • aCHG limitations: • Detects only copy number variants. • Requires that genomic probes from the reference genome lie in non-repetitive regions. Nesthor Perez

3.2 Next-generation DNA Sequencing Technologies • Since DNA sequencing technology has demonstrated substantial sophistication, the DNA analysis cost has decreased a lot, too. • A limitation can be the length of a DNA that can be sequenced. • DNA short sequences range from 30 to 1000 nucleotides, or base pairs (bp). Nesthor Perez

3.2 Next-generation DNA Sequencing Technologies • Some DNA sequence technologies use a paired-end sequencing protocol to increase read length. • At earlier Sanger sequencing protocols the DNA fragments size depended on the cloning vector. • At next-generation technologies, several techniques have been used to generate paired reads. • Today, latest techniques produce paired reads from fragments of only a few hundred bp to fragments of 2-3 kb. Nesthor Perez

3.2 Next-generation DNA Sequencing Technologies • Next-generation sequencing technologies have limited read lengths and limited insert sizes in comparison to Sanger sequencing. • Two approaches to detect SV’s using DNA next-generation technology: • Novo Assembly: • Sophisticated algorithms are used to reconstruct genome sequences from overlaps between reads. • Human genome assemblies are highly fragmented. Nesthor Perez

3.2 Next-generation DNA Sequencing Technologies • Two approaches to detect SV’s using DNA next-generation technology: • Resequencing: • Differences are found between an individual genome and a related reference genome. • These differences are the same differences between the aligned reads and the reference sequence. Nesthor Perez

3.2 Next-generation DNA Sequencing Technologies Advantages: From earlier DNA Generation to new sequencing technology: Disadvantages: Limitation in the length of a DNA molecule to be sequenced: Today’s technologies produce “SHORT SEQUENCES” of DNA. Range: 30 1000 nucleotides In order to increasereadlength, these DNA sequencingtechnologies use: PairedEndor Mate Pair Nesthor Perez

3.2 Next-generation DNA Sequencing Technologies There’retwoapproaches to detectSVs: Nesthor Perez

3.3 New DNA Sequencing Technologies • Previous DNA technologies challenges have been several limitations. • For example: • SV’s breakpoints in high-repetitive sequences. • Third-generation and single molecule technologies offer additional advantages for SV’s: • Longer reads lengths. • Easier sample preparation. • Lower input DNA requirements. • Higher throughput. Nesthor Perez

3.3 New DNA Sequencing Technologies • Third-generation technologies expected improvements: • Paired reads: Include more than two reads from a single DNA fragment. • Long-range sequence information with low input DNA requirements. • Sequencing technologies keep a fast development thanks to the improvements of: • Chemistry. • Imaging. • Technology manufacture. Nesthor Perez

3.3 New DNA Sequencing Technologies • New improvements are expected about: • Increasing read lengths. • Inserting lengths. • Enhancing throughput. • A new sequencing technology is the “Nanopore”, which directly read the nucleotides of long molecules of DNA, giving a dramatic advance. • Using Nanopore, extremely long reads (tens of kb) are generated. Nesthor Perez

3.3 New DNA Sequencing Technologies New features: Longer read lenghts: Higher throughput: Nesthor Perez

3.3 New DNA Sequencing Technologies New features: Easier sample preparation Nesthor Perez

3.3 New DNA Sequencing Technologies New features: Lower input DNA requirements: Nesthor Perez

3.3 New DNA Sequencing Technologies Keep active development thanks new improvements around: Chemistry: Imaging Processing: Data Processing: Nesthor Perez

4. Resequencing Strategies for Structural Variation • Purpose: Predict SV’s by alignments of sequence reads to the reference genome. • Steps: • Alignments of reads • Prediction of SV’s from alignments. • Resequencing is straightforward in principle but detection of SV’s in human genomes is really hard. • Some types of SV’s are easy to detect, other are really difficult. Nesthor Perez

4. Resequencing Strategies for Structural Variation Step 1: Alignments of reads: Reads

4. Resequencing Strategies for Structural Variation Step 2: Predictions of SVs from alignments: “Disease”

4. Resequencing Strategies for Structural Variation • Some SV’s are hard to detect due technological limitations and biological features. • Technological limitations: • Sequencing errors. • Limited read lengths. • Insert sizes. • SV’s biological features : • Enriched for repetitive sequences near their breakpoints. • Overlap: multiple states or complex architectures. • Recurrent variants at the same locus. Nesthor Perez

4. Resequencing Strategies for Structural Variation • Therefore, alignments and predictions of SV’s are not easy tasks. • Effective algorithms are required for highly sensitive and specific predictions of SV’s. • Three approaches to identify SV’s from aligned reads: • Split reads. • Depth of coverage analysis. • Paired-end mapping. Nesthor Perez

4.1 Read Alignment • This is one of the most researched problem in Bioinformatics. • Specialized task of aligning millions to billions of individual short reads is done by software like: • Maq. • BWA. • Bowtie/Bowtie2. • BFAST. • mrsFAST. Nesthor Perez

Exploring Human Genome Structural Variations for Disease Understanding