1 / 53

The Role of Bioinformatics in Cancer Biotechnology

The Role of Bioinformatics in Cancer Biotechnology. Bob Stephens Advanced Biomedical Computing Center Information Systems Program Feb 24, 2012.

edward
Télécharger la présentation

The Role of Bioinformatics in Cancer Biotechnology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Role of Bioinformatics in Cancer Biotechnology Bob Stephens Advanced Biomedical Computing Center Information Systems Program Feb 24, 2012

  2. OutlineOrigin of BioinformaticsWhy the expanding importance ?Nextgen Sequencing (Big Data)Integrative and Systems Biology (Complex Data)How to pursue interest in bioinformaticsDiscussion

  3. What is bioinformatics ? Bioinformatics is the application of computational methods to the analysis of any type of biological data.Bioinformatics has become a diverse and multi-disciplined field that originally derived from computer and biological sciences.Now has sub-disciplines such as medical informatics, systems biology and clinical informaticsAs a result, PhD and masters level programs have emerged dedicated to different aspects of bioinformatics.

  4. Evolution of bioinformatics1980s began as methods to scan protein and nucleic acid sequence databases for similarity (both available in print form)Rapid technological advances in across multiple biological domains set the pace for data acquisition.MicroarraysNextgen sequencingImagingProteomicsSimilar advances in computing power and algorithmic approaches for sequence analysis, robotics enabled instruments.Co-evolution with web browser and programming language technologies (now cloud).

  5. What is the ABCC ? The ABCC is part of the Information Systems Program (ISP), a division of SAIC-F.The ISP is interconnected with the Advanced Technology Program and supports its computational needs.The ISP has computational infrastructure, system administration, networking, security, bioinformatics support and program development all under one roof.One of several NIH and NCI intramural bioinformatics resources

  6. Layered Infrastructure Support●The CCR-IFX core consists of analyzers and users.●The BSG uses databases and apllications, tools, utilities and resources.●The SCPD uses algorithms, development and optimizaiton●The ISP computes, stores and networks.

  7. Bioinformatics infrastructuresCommand-line implementations (open source).Primitive GUI implementations.Sophisticated GUI interfaces and application packaging.Web interface and Java language gives platform independent access.PC-based, web-based and server-based architectures.Multiple tier infrastructures distributes computational burden.Cloud-based – limited by data volumes

  8. How can bioinformatics facilitate cancer research ? • Diagnosis - identify classifiers to better sub-divide cancer etiologies into groups. Better individual data to put treatment and individual together. • Treatment - identify better methods to track treatment progress and indicate problems earlier. • Prevention - understand mechanisms for cancer initiation, progression and development and identify targets in this process. • Connect cancer patient data from geographically distributed cancer patients for more complete analysis, esp. for rare cancers.

  9. NCI NGS ENVIRONMENT

  10. NCI-F NGS Instrument Landscape The SF-ATC, POB-STC, PCC/CADC, LMT, CGF-ATC and NCI labs interact.

  11. NCI-F NGS Geographical Landscape ATRF LMT NCI-F ATC NIH

  12. “Next-Generation” Sequencing Technologies

  13. Clinical Samples K

  14. NGS Analysis Steps●Primary analysis includes base calling and Qc/QA filtering●Secondary analysis include mapping, coverage analysis, expression analysis, identify variants and impact assessment.●Tertiary analysis includes comparison tools and interpretation.

  15. Typical Variant Identification Pipeline●A raw read after QC filtering yields an input read which may be mapped, unmapped, ambiguous or not mapped.●The mapped read can be split and read depths yield concordant and discordant mates. ●The split reads yields SV, CNV, cell variants, class enrichment, impact analysis and disease association.

  16. Mapping ConsiderationsMany different mapping applications availableEach with complex set of mapping parametersMapping is only as good as the reference it maps againstMore mapping not necessarily better – finds best available site(s)Same platform, different mappers will yield different mapping percentages and influence variant callsMappers will need to continue to evolve to allow for multiple references to be searched

  17. Potential Mapping IssuesReference genome is incomplete – Ns and many breaks per chromosomeReference genome contains many repeats that are very large and very similar (way longer than current read lengths)Reference genome contains many regions known to vary by copy number or be subject to structural variationReference genome contains an ancestry bias

  18. What does our reference look like ? 7.5% 234mbp Ns 47 % 1.45gbp Repeats 45 % 1.45gbp NonN,NonRpt

  19. CNV Coverage By Percent

  20. Read fate mapping What are the unmapped reads ? – most map to alternate assemblies; some do not mapWhen alternative alleles are considered, some 2/3 of reads that should have mapped to them were mapped elsewhere !Although only a small fraction of reads do not map, we can not easily estimate the number that did not map correctly

  21. 1k genomes (NA18508) Mapped to Chr_Un 6 M reads 1 % Unmapped reads 13.1 M reads 3 % Mapped to Chromosomes and MT 430 M reads 96 %

  22. Worst case scenario - PSPHL

  23. PSPHL ComplexitiesThe gene is located on chromosome 7 with a 55kb insert location.There is also an:indel locus 427 bp with 99.6% identityancestral locus 106 kb with 95% identityand additional locus 465 bp with 95% identity

  24. The PSPHL gene structure, and the deletion breakpoint, was determined by sequence mining. However, there was a huge gap to fill. hg18

  25. Variant CallersMany variant callers exist, some components of larger applications, some stand-aloneLike mappers, many variables and filtering steps – all alter the false discovery/false negative ratesSNVs fairly well worked out, indels more difficult to identify and lower validation ratesEmerging consensus is to modularize workflow – best-in-breed mapper followed by best-in-breed variant caller, more dynamicNeed “truth” set to validate (Ventor ?)

  26. Overlap amongst SNVs called by 3 popular variant callers (likely SAMTools, GATK and CLCBio)

  27. Double identity

  28. Systems biology overview

  29. GoalsBackground detailsMechanisms of connectivityLevel 1 integrationMore sophisticated integrationComplex interpretation needs

  30. Supporting infrastructureDeep and complete genomic annotation for species of interestSystem to connect different data id typesOntology/controlled vocabulary for harmonization (apples==apples) [common data elements]Visualization capabilities for networks, heatmaps and genomic context

  31. Pathway Gene Set AnalysisMany experiments result in sets of genes, eg microarray, proteomics, literature searches etc.Clustering genes based on expression etc. provides only first dimension.View prospective pathways impacted by changes in expression, protein levels, phosphorylation etc.

  32. In-House NextGen WPS:Pathway-based Platform from Array/Proteomics to NGS Tertiary Analysis

  33. VisualizationPathway/network – cytoscape and wpsGenome – many viewers availableHeatmap – simple R tools available

  34. 2 principal integration tiers Gene level – measurements associated/collected at or below gene level – expression, proteomics, phosphorylation, binding, metabolomic etc.Genome coordinate level – chipseq, cytogenetics, gwas, arrayCGH etc.Both are incompletely annotated and complex

  35. Integration Goal: Use the database and application infrastructure to create new integrated applications Q: Tell me everything and anything you know about my gene Where we are: biodbnet.abcc Data level integration of all 32 databases Seamless updates at the backend Ortholog, batch and custom conversions medXminer Complete Medline in Oracle XMLDB Near future: *expand bioDBnet Integrate semantics layer into literature Development of new applications

  36. Integration Example: bioDBnet193 biological identifiers from 32 biological databasesbioDBnet integrates proteomics, genomics transcriptomics, metabolomics to yield functional annotation, gene, drub, tzxon, disease, interaction, protein, microarray, protein features, pathway and variation/polymorphism.

  37. bioDBnet – Biological DataBase NetworkIntegrates 32 widely used biological databasesThe network has 193 nodes and 658 edgesHandles batch conversions across databases(db2db), orthology conversions(dbWalk), organism wide conversions(dbOrg) and generates detailed annotation reports(dbReport)Major advantage - database update procedure is completely automated and does not impact operations

  38. Integration layersFirst layer connects measurements through gene associationsSecond layer recognizes feedback and interactions and network complexities and builds on top of that

  39. Java Web Start Version Overview WPS: An in-house Pathway Analysis, Visualization, and Data Integration Tool Tertiary Analysis for NGS To WPS, NGS data is just another resource of data from different platforms in parallel with Metabolome, Microarray, Proteomics data etc. Migration NextGen WPS for NGS data and other high throughput data Whole pathway scope

  40. SLEPR and Pathway-level Pattern Extraction (PPEP) (PPEP) (PPEP) (SLEPR: Yi and Stephens, PlosOne 2008, 3(9):e3288) (PPEP: Yi M, Mudunuri U, Che A, Stephens R, BMC Bioinformatics 2009, 10:200) (WPS: Yi et al, BMC Bioinformatics 2006, 7:30)

  41. BioCarta Pathways Uncovered by SLEPR but not by Conventional Method From Breast Cancer Array Dataset (NCI) Significant under SLEPR method but not by Conventional way

  42. BioCarta Pathway: Spliceosomal Assembly

  43. GBrowse

  44. IGV

  45. Computational Integration in Biomarker Discovery:Testing and validation include mechanistic studies in mice and biomarker validation in the clinic.

  46. Computational Integration across species. Analyze mouse or clinical data featuring selection and modeling. Network signatures, molecular signatures and candidate biomarkers are calculated.

  47. SysBioCube

  48. Bioinformatics Directions/GrowthData visualization – required to pull together and interpret the huge volumes of data now being producedData integration – often signs of disease can be diagnosed at different levels requiring “big picture” to be drawn for full understanding.Natural Language Processing – there are simply too many papers for humans to do all of the reading and comprehension.Controlled vocabularies – allow for apples and apples to be compared.

  49. Clinical Sequencing and Cancer Companies already offering cancer diagnostic panelsTied to proprietary in-house clinical variation databasesConnect sets of mutations to clinically actionable treatments and/or trialsIdentify likely responders/non respondersIssues: EthicsCounseling Non-actionable targets

  50. Training in bioinformatics ?Skill set needs to encompass aspects of both biological science and computer science.Direct access to relevant scientific questions through own research or close ties to scientific community.Ability to adapt to new questions, applications and data types.

More Related