1 / 64

NGS data analyses with BioUML

NGS data analyses with BioUML. Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd. Novosibirsk, Russia. Agenda. BioUML overview NGS tools quality control alignment tools annotation tools workflows Genome browser Archakov’s genome Ribosome profiling Live demonstration.

alma
Télécharger la présentation

NGS data analyses with BioUML

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NGS data analyseswith BioUML Fedor Kolpakov Biosoft.Ru, Ltd. Institute of Systems Biology, Ltd.Novosibirsk, Russia

  2. Agenda • BioUML overview • NGS tools • quality control • alignment tools • annotation tools • workflows • Genome browser • Archakov’s genome • Ribosome profiling • Live demonstration

  3. BioUML overview

  4. BioUML platform BioUML is an open source integrated platform for systems biology that spans the comprehensive range of capabilities including access to databases with experimental data, tools for formalized description, visual modeling and analyses of complex biological systems. Due to scripts (R, JavaScript) and workflow support it provides powerful possibilities for analyses of high-throughput data. Plug-in based architecture (Eclipse run time from IBM is used) allows to add new functionality using plug-ins. BioUML platform consists from 3 parts: BioUML server – provides access to biological databases; BioUML workbench – standalone application. BioUML web edition – web interface based on AJAX technology;

  5. Main platforms for bioinformatics and BioUML Taverna R/Bioconductor scripts, statistics, plots standalone application powerful workflows standalone applicationpowerful workflows scripts, statistics, plots BioUML platform Eclipse plug-in based architecture, chemoinformatics web interface, collaborative research genome browser Eclipse plug-in based architecture, chemoinformatics workflows, web interface, collaborative research, genome browser Galaxy BioClipse

  6. Main platforms for bioinformatics and BioUML Taverna R/Bioconductor scripts, statistics, plots standalone application powerful workflows • + systems biology • visual modelling • simulation • parameters fitting • … • + chat for on-line consultations standalone applicationpowerful workflows scripts, statistics, plots BioUML platform Eclipse plug-in based architecture, chemoinformatics web interface, collaborative research genome browser Eclipse plug-in based architecture, chemoinformatics workflows, web interface, collaborative research, genome browser Galaxy BioClipse

  7. Market Android market AppStore Biostore Android MacOS,iPOD, iPhone Platform BioUML

  8. BioUML ecosystem Developers- plug-ins: methods, visualization, etc. - databases Users- subscriptions- collaborative & reproducible research • Experts • services for data analysis • on-line consultations provide toolsand databases use provide services Biostore BioUMLplatform

  9. NGS • интегрированные в BioUML методы (Bowtie, MACS, ChIPHorde, ChIPMunk, …) • программы, интегрированные в Galaxy • пакеты R • аннотация найденных пиков (SNP, сайтов и т.п.) • визуализация • workflows • ChIP-SEQ • RNA-SEQ • сборка и аннотация генома человека (в процессе) • поддержка распарелеливания внешних программ как часть workflow • база данных GTRD (на основе данных ChIP-SEQ) • выделенные сервера • Amazon EC2 – по запросу • Biodatomics – 64 ядра, 256 Гб памяти.

  10. Galaxy – analyses methods

  11. Galaxy - workflow

  12. Preprocess raw reads Remove reads not satisfying simple quality tests, removes adapters, trims low quality bases from read ends Track statistics Gather various statistics about track or FASTQ file Raw data preprocessing

  13. Bowtie fast no indels used for chip-seq выравнивание коротких ридов: Novoalign -single-end and paired-end - in nucleotide and color space - handle indels, - finds global optimum alignments using full Needleman-Wunsch algorithm

  14. RNA-seq with tophat and Cuff* tools

  15. Bowtie for alignment MACS for peak calling ChipMunk, IPS, MEME for motif discovery ChIP-seq

  16. Popular NGS toolboxes available: GATK, Picard, SAM tools

  17. An example: workflow for analyses of ChIP-Seq data

  18. example: RNA-seq workflow

  19. NGS data quality control2 examples: rna-seq data (rat, IPS )genome data – Archakov’s genome

  20. Track statistics (FastQC) • Estimate quality of RAW or aligned reads like in FastQC program http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • All original FastQC processors are supported • Works faster than FastQC • Additional processor: Overrepresented prefixes • Overrepresented K-mers works more precise (do not skip 80% of sequences) • Along with HTML report separate statistics tables are generated and accessible for further analysis • Ability to merge several reports into composite report • As any BioUML analysis can become a part of workflow, scripts, etc. • Tested on Archakov AP3 (RAW reads: 5.9Gb csfasta+12.7Gb qual),analysis time: 36 min (all processors) • Tested on Zakian db50 (RAW reads: 6.5Gb fastq),analysis time: 7 min (all processors)

  21. Track statistics launch Input data: BAM, FastQ and Solid (colorspace) data supported Whether reads should be aligned by left or right side Switch off individual processors to save time.

  22. Track statistics results (Archakov AP3): Quality per base

  23. Track statistics results (Archakov AP3): Quality per sequence

  24. Track statistics results (Archakov AP3): Nucleotide content per base

  25. Track statistics results (Archakov AP3): GC content per base

  26. Track statistics results (Archakov AP3): GC content per sequence

  27. Track statistics results (Archakov AP3): N content per base

  28. Track statistics results (Archakov AP3): Duplicate sequences

  29. Track statistics results (Archakov AP3): Overrepresented sequences and 5-mers

  30. Track statistics results (Archakov AP3): Overrepresented prefixes

  31. Track statistics results (Zakian db50): Quality per base

  32. Track statistics results (Zakian db50): Quality per sequence

  33. Track statistics results (Zakian db50): Nucleotide content per base

  34. Track statistics results (Zakian db50): GC content per base

  35. Track statistics results (Zakian db50): GC content per sequence

  36. Track statistics results (Zakian db50): N content per base

  37. Track statistics results (Zakian db50): Duplicate sequences

  38. Track statistics results (Zakian db50): Overrepresented sequences and 5-mers

  39. Genome browser

  40. Genome browser:main features • uses AJAX and HTML5 <canvas> technologies • interactive - dragging, semantic zoom • tracks support • Ensembl • DAS-servers • user-loaded BED/GFF/Wiggle files

  41. DAS The Distributed Annotation System (DAS) defines a communication protocol used to exchange annotations on genomic or protein sequences. It is motivated by the idea that such annotations should not be provided by single centralized databases, but should instead be spread over multiple sites. Data distribution, performed by DAS servers, is separated from visualization, which is done by DAS clients. DAS is a client-server system in which a single client integrates information from multiple servers. It allows a single machine to gather up sequence annotation information from multiple distant web sites, collate the information, and display it to the user in a single view. DAS is heavily used in the genome bioinformatics community. Over the last years we have also seen growing acceptance in the protein sequence and structure communities.

  42. Genome browser Two BAM tracks are compared with each other (Example view on Human NCBI37 Chr.1) Profile is visible showing the coverage

  43. Genome browser Upon zooming individual reads become visible. All information associated with selected read is displayed in the Info box

  44. Genome browser In detailed scale phred qualities graph is displayed along with changed nucleotides between read and reference sequence

  45. NGS dataArchakov’s genome

More Related