1 / 82

Applied Bioinformatics

Applied Bioinformatics. Week 5. Topics. Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads Gene Prediction Prokaryotic Eukaryotic. Theoretical Part I. DNA sequencing Next generation sequencing Cleaning nucleotide sequences. DNA Sequencing. Sanger Method Please explain

xena
Télécharger la présentation

Applied Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applied Bioinformatics Week 5

  2. Topics • Cleaning of Nucleotide Sequences • Assembly of Nucleotide Reads • Gene Prediction • Prokaryotic • Eukaryotic

  3. Theoretical Part I • DNA sequencing • Next generation sequencing • Cleaning nucleotide sequences

  4. DNA Sequencing • Sanger Method • Please explain • Other methods • Too many to discuss • http://en.wikipedia.org/wiki/DNA_sequencing

  5. Shotgun Sequencing • Many short (~700 N) sequences • Human genome sequencing project • Finished? • How can you make sense of these sequences? • Contrast: • Genome walking

  6. Next Generation Sequencing • Increases the throughput of sequencing • More sequence per time • Not more sequence per read (still around 500) • Many commercial platforms available • 454 pyrosequencing • Illumina (Solexa) sequencing • ... • Price is dropping • Whole genomes in a day • http://www.1000genomes.org/

  7. 454 Pyrosequencing http://genepool.bio.ed.ac.uk/

  8. Illumina sequencing http://seqanswers.com/forums/showthread.php?t=21

  9. Where from is your DNA • Did you just clone and sequence? • Did you sent a sample to a company? • Did you find the sequence in a database? • Better make sure it is correct and clean

  10. Vector Contaminations image: Wikipedia Long DNA pieces are fragmented and cloned into vectors before sequencing. This usually causes some amount of vector to be sequenced along with the insert.

  11. Adapter Contaminations Long DNA pieces are fragmented and adapter sequences are ligated to both ends of the fragments before sequencing. This causes adapters to be sequenced along with the desired sequence.

  12. Contaminations Cause Misassembly One important outcome of not removing contaminations from genomic sequences is that they cause misassembly of sequences

  13. Cleaning Contaminations • Several approaches and tools to clean vector contaminations from genomic sequences have been developed. • Most of them rely on a reference vector library, including: • LUCY, LUCY2 • SeqTrim • DeconSeq • TagCleaner • cross_match • SeqClean • VecScreen

  14. Problem Definition A vector is a circular DNA sequence. After being linearized in reference libraries, vector contaminations around the linearization point can no more be detected and cleaned by currently available tools.

  15. UniVec • A vector library by NCBI • Problems: • Has complete sequences for only 8 vectors, although full length sequences are available on public databases for the rest as well. • Only these 8 vectors are appended to themselves by 49 nt to overcome circularization problem. • Some vectors are divided into partitions, for no apparent reason. • Some adapter sequences are appended to themselves as well, whereas some are not.

  16. Previous Solution Y.-A. Chen, C.-C. Lin, C.-D. Wang, H.-B. Wu, andP.-I. Hwang, “An optimized procedure greatly improvesEST vector contamination removal,” 2007. Not designed for entire libraries Proposes cutting the first 60 nucleotides from the start of a vector sequence and pasting it to the end by using a simple text editor No more has an implementation

  17. Our Solution Appending all (or filtered by the user) vector sequences in a reference library to themselves or to first n number of nucleotides (n chosen by the user) As customizable as possible, but still efficient with a single click Has a GUI for target-users

  18. Our Solution • Possible Customisations • Cleaning already introduced appendices in the library • Filtering the sequences by a keyword in their definition lines and/or by length • Virtual Circularization • Appending sequences to themselves by first n nucleotides

  19. Efficiency of Our Method • Datasets: • Every 600th EST • P. somniferum EST • Artificial Data • Vector Libraries • rawUV • cleanUV • appUV

  20. Theoretical Part I • Mind Mapping • Break 10 min

  21. Practical Part I

  22. Screening for Vector seqs • www.ncbi.nlm.nih.gov/VecScreen • Get the U87251 sequence (FASTA) • What is this number? • Enter the sequence and run the analysis • What do you see as a result? • Would you continue with the experiment? • Would you discard the sequence?

  23. Sequencing • Since we cannot do any sequencing here we have to prepare a simulation • Select a nucleotide sequence of about 15000 bases • Copy and paste that sequence into word • 3 times • Separated by empty lines

  24. Sequencing • Arbitrarily add linebreaks into the resulting document • At least 30 (10 per copy min) • Spread out throughout the sequence • Add a FASTA definition line after each line break • Use >Copy-N-Fragment-X as a template for the definition line • Ensure that the overall number of characters is less than 50000

  25. Practical Part I • 15 min break

  26. Theoretical Part II • Sequence Assembly • Gene Prediction

  27. Assembling Sequences • Shotgun sequencing • Sequence fragments • Find overlapping fragments • Build contiguous sequences (contig) • Assemble into whole genomes • Genetic and physical maps • Help orient fragments and contigs • Problems with repetitive sequences

  28. Sequence Tagged Sites Physical map Up to 200 bp long Unique for a region of the genome STS reference map Map to assemble BAC/ PAC clones Repeat process to map contigs to clones

  29. Sequence Tagged Site Endonuclease Site Sequence Tagged Site Chromosome The restriction enzyme should digest the DNA into approximately 200 kB long fragments

  30. Fragments with STS Up to 700 kB! If it fits into a plasmid (Up to 10 kB) Shortest Chromosome (21) 47 mB -> 250 BACs

  31. 1 BAC -> 10 – 50 Plasmids / Cosmids Plasmid / Cosmid

  32. Use several nucleases EcoRI BamHI HindIII Target ~ 1000 nucleotides Polymerase Chain Reaction will lead predominantely to: Primer

  33. Restriction Sequence with degenerate primers? or subclone and sequence Sequencing Clone01: ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT Clone02: TGTGTAGCTAGCTGCGGCGCTAGGATAGGCATCTAGCTATCGGACTCTGTG ... Clone20: GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT ...

  34. >Clone01 ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT >Clone02 TGTGTAGCTAGCTGCGGCGCTAGGATAGGCATCTAGCTATCGGACTCTGTG ... >Clone20 GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT ... Smith-Waterman or more specialized Alg. all vs all Check here as well Clone01 ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT |||||||||||| Clone20 GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT Check here as well

  35. GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT ACCGACTACGATCGCACT |||||| |||||||||||| |||||||||||| |||||||||||| |||||||||||| TAGTACCGGTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT Not proportional Chromosome For each plasmid the BAC and therefore the position on the chromosome is known Sequencing all plasmids will give the complete sequence of the genome !Caution! Highly simplified Why? What does coverage mean?

  36. Assembling Software • As you just saw assembling sequences is computationally expensive • Therefore most software is not available online but often freely for download

  37. What is Computational Gene Finding? Given an uncharacterized DNA sequence, find out: • Which region codes for a protein? • Which DNA strand is used to encode the gene? • Which reading frame is used in that strand? • Where does the gene starts and ends? • Where are the exon-intron boundaries in eukaryotes? • (optionally) Where are the regulatory sequences for that gene? Computational Gene Finding

  38. Prokaryotes: small genomes 0.5 – 10·106bp high coding density (>90%) no introns Gene identification relatively easy, with success rate ~ 99% Problems: overlapping ORFs short genes finding TSS and promoters Eukaryotes: large genomes 107 – 1010 bp low coding density (<50%) intron/exon structure Gene identification a complex problem, gene level accuracy ~50% Problems: many Prokaryotic Vs. Eukaryotic Gene Finding Computational Gene Finding

  39. Gene Structure Computational Gene Finding

  40. Gene Finding: Different Approaches • Similarity-based methods (extrinsic) - use similarity to annotated sequences: • proteins • cDNAs • ESTs • Comparative genomics - Aligning genomic sequences from different species • Ab initio gene-finding (intrinsic) • Integrated approaches Computational Gene Finding

  41. Similarity-based methods • Based on sequence conservation due to functional constraints • Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases • Will not identify genes that code for proteins not already in databases (can identify ~50% new genes) • Limits of the regions of similarity not well defined Computational Gene Finding

  42. Comparative Genomics • Based on the assumption that coding sequences are more conserved than non-coding • Two approaches: • intra-genomic (gene families) • inter-genomic (cross-species) • Alignment of homologous regions • Difficult to define limits of higher similarity • Difficult to find optimal evolutionary distance (pattern of conservation differ between loci) Computational Gene Finding

  43. Computational Gene Finding

  44. Summary for Extrinsic Approaches Strengths: • Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictions Weaknesses: • Limited to pre-existing biological data • Errors in databases • Difficult to find limits of similarity Computational Gene Finding

  45. Ab initio Gene Finding, Part 1 Input: A DNA string over the alphabet {A,C,G,T} Output: An annotation of the string showing for every nucleotide whether it is coding or non-coding AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG Gene finder AAAGC ATGCAT TTA ACG A GT GCATC AG GA CTC CAT ACGTAA TGCCG Computational Gene Finding

  46. Ab initio Gene Finding, Part 2 • Using only sequence information • Identifying only coding exons of protein-coding genes (transcription start site, 5’ and 3’ UTRs are ignored) • Integrates coding statistics with signal detection Computational Gene Finding

  47. Coding Statistics, Part 1 • Unequal usage of codons in the coding regions is a universal feature of the genomes • uneven usage of amino acids in existing proteins • uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) • We can use this feature to differentiate between coding and non-coding regions of the genome • Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein Computational Gene Finding

  48. Coding Statistics, Part 2 • Many different ones • codon usage • hexamer usage • GC content • compositional bias between codon positions • nucleotide periodicity • … Computational Gene Finding

  49. An Example of Coding Statistics, Part 1 Computational Gene Finding

  50. Computing Coding Statistics in Practice • Usually, the value of coding statistics is computed using sliding windows coding profile of the sequence • Larger windows are required to detect a clear signal (50 – 200 bp) Computational Gene Finding

More Related