bioinformatics in cancer biotechnology n.
Skip this Video
Loading SlideShow in 5 Seconds..
Bioinformatics in Cancer Biotechnology PowerPoint Presentation
Download Presentation
Bioinformatics in Cancer Biotechnology

Bioinformatics in Cancer Biotechnology

1075 Vues Download Presentation
Télécharger la présentation

Bioinformatics in Cancer Biotechnology

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Bioinformatics in Cancer Biotechnology Bob StephensAdvanced Biomedical Computing CenterAdvanced Technology ProgramSAIC-Frederick, Inc.National Cancer Institute at Frederick April 19, 2007

  2. Objectives • Overview/introduce bioinformatics concepts, applications and databases. • Describe interplay between bioinformatics, technologies and the web. • Profile importance of bioinformatics in cancer research. Cancer Biotechnology Series

  3. What is bioinformatics ? • Bioinformatics is the application of computational methods to the analysis of any type of biological data. • Bioinformatics has become a diverse and multi-disciplined field that originally derived from computer science and biological science. Cancer Biotechnology Series

  4. Evolution of bioinformatics • Rapid technological advances in sequence determination set the pace for data acquisition. • Similar advances in computing power and algorithmic approaches for sequence analysis, robotics enabled instruments. • Co-evolution with web browser and programming language technologies. Cancer Biotechnology Series

  5. Bioinformatics evolution (contd.) • Additional high throughput technologies becoming available almost daily - microarrays, proteomics, population and genetic data, medical literature etc. • Data volume is increasing at the same time as data complexity. • Data distribution/synchronization becoming an increasingly difficult task. Cancer Biotechnology Series

  6. Interplay between technology and bioinformatics • New HT Technologies, eg. mRNA microarray • Analysis and storage software • Computational infrastructure • Data integration Cancer Biotechnology Series

  7. Example • mRNA expression chip (20000 genes x 16 probes per gene), a few mb per sample. • Data normalization software. • Exon array - multiple probes for each exon for each of the 20000 genes - one file about 1gb. • New normalization method requires all samples to be loaded simultaneously. • More complex analysis reveals alternative splicing etc. Cancer Biotechnology Series

  8. Interface of technologies and biology • Experimental design very important in HT biology • Experiments shaped by data access and availability • Re-analysis of old data with new methods important Cancer Biotechnology Series

  9. Cancer Biotechnology Series

  10. Bioinformatics historical perspective • Stage 1 - bioinformatics term is coined to represent what had been DNA and protein sequence analysis (ca. 1995) • Stage 2 - additional disciplines become rolled into bioinformatics including literature mining, statistical analysis, and virtually anything to do with computational analysis of biological data. (ca. 2000) Cancer Biotechnology Series

  11. Bioinformatics - historical perspective (contd) • Realization that bioinformatics is too broad a term, other disciplines break away eg. OMICs fields (eg genomics, proteomics others (ca. 2001). • Still later (current) realization is made that we wont be able to make any sense of individual disciplines without integrating them together, term now changed to integrative biology or systems biology (ca. 2003). Cancer Biotechnology Series

  12. Importance of bioinformatics • Bioinformatics has become a major part of both the NCI 2015 directive and the NIH Roadmaps. • Virtually impossible to perform biological research without some form of computer aided analysis, especially in areas like genomics and proteomics. • Important to keep scientific community in touch with developing technologies and capabilities for highest return on research investment. Cancer Biotechnology Series

  13. Bioinformatics infrastructures • Command-line implementations. • Primitive GUI implementations. • Sophisticated GUI interfaces and application packaging. • Web interface and Java language gives platform independent access. • PC-based, web-based and server-based architectures. • Multiple tier infrastructures distributes computational burden. Cancer Biotechnology Series

  14. What does bioinformatics technology involve ? • Computer readable form of some type or types of biological data (instruments) • Automation also requires programmable robotics capabilities (process science). • Computer infrastructure for storing and analyzing the data. • As data volume and complexity grows, the dependency on computer analysis increases. Cancer Biotechnology Series

  15. Sources of bioinformatics technology • Computer science leveraged technologies including algorithms and data representation models, visualization frameworks and programming languages. • Web industry leveraged technologies including communication protocols, web servers and secure access. • Database industry derived connectivity and technologies. • Robotics and process engineering technologies for faster, cheaper throughput. Cancer Biotechnology Series

  16. What can bioinformatics technology do for biological science ? • Develop uniform data standards and controlled vocabularies to allow for integration of disparate sources/types of data. • Connect scientists to entire wealth of knowledge from basic science results to clinical trial data in context-sensitive manner. • Fully integrate worldwide volume of knowledge, for example patient information disease->treatment->outcome across multiple centers to allow for cross-comparisons. Cancer Biotechnology Series

  17. NCI Resources • caBIG NCICB Initiatives to develop integrated data/tool environment.. • Long term project requiring unprecedented cooperation, sharing. • Short term solutions for day-to-day problems. • Solution - use multiple approaches, staged implementation and layered technologies Cancer Biotechnology Series

  18. Cancer Biotechnology Series

  19. ABCC hardware • 128 cpu linux cluster (3.0 ghz processors). • 256 cpu linux smp box with 1Tb memory. • 64 cpu IRIX smp box with 256gb memory. • 32 cpu IBM AIX smp computers. • 16 cpu IBM HPC AIX smp computer. • 8 x 8cpu IRIX computers. • Other miscellaneous computers, disk storage, tape backup and network connectivity. • Graphics visualization wall Cancer Biotechnology Series

  20. Cancer Biotechnology Series

  21. ABCC Organization • Networking and Security • System administration • Scientific program development • Bioinformatics support • Staff ~ 40 Cancer Biotechnology Series

  22. ABCC Training Programs • Classes for NIH/NCI scientists: • Unix, GCG, Java, High throughput sequence analysis, Geospiza (LIMS) • Eudora, Advanced Eudora, Webmail • Homology, Docking, QSAR, Intro to Modeling, Phred, Phrap, Consed • One-on-one consulting services and training. • Organize and host vendor specific training in genomics, pathways, and modeling Cancer Biotechnology Series

  23. ABCC Support within ATP Proteomics and Analytical Technologies (LPAT) Computational Support Database Tools/Pathways Mass Storage and Archive Pattern Analysis and Clustering Molecular Technologies (LMT) Image Analysis (IAL) Computational Support Database Tools and LIMS Mass Storage and Archive Bioinformatics/Web Pattern/SNP Analysis ABCC Algorithm and Software Image Database Mass Storage and Archive Viz Technology Development Gene Expression (GEL) Protein Chemistry (PCL) Software Support Gene Assembly and Validation Protein Expression (PEL) Animal Sciences (LASP) Mass Storage Database POET/Web Cancer Biotechnology Series

  24. ABCC applications • Sequence analysis - protein and nucleic acid, GCG and EMBOSS. • Sequence assembly, SNP detection. • Gene finders, analysis tools. • Molecular modeling, docking. • Molecular evolution and phylogeny. • Computational chemistry. • Linkage analysis. • Proteomics. • Classification tools (microarray and proteomics). Cancer Biotechnology Series

  25. ABCC databases • Genbank and derived divisions. • Refseq, WGS, unigene divisions. • dbSNP, gene, OMIM, homologene. • UCSC, EBI and ncbi genome datasets. • LIMS systems, data management. • Uniprot, PDB, PIR, iProClass, Swissprot. • CGAP, MGC data files, pathways. • Medline, transfac and repeats data files. Cancer Biotechnology Series

  26. ABCC web resources • ABCC General information web page • ABCC account application information • ABCC Training web page • ABCC scientific applications webpage • ABCC GRID Database web page http://grid.abcc/ • ABCC Pipelines web page Cancer Biotechnology Series

  27. The role of bioinformatics in cancer research • Diagnosis - identify classifiers to better sub-divide cancer etiologies into groups. Better individual data to put treatment and individual together. • Treatment - identify better methods to track treatment progress and indicate problems earlier. • Prevention - understand mechanisms for cancer initiation, progression and development and identify targets in this process. • Connect cancer patient data from geographically distributed cancer patients for more complete analysis. Cancer Biotechnology Series

  28. Protein analysis tools • Protein composition, isoelectric point, molecular weight analysis tools. • Comparable alignment/searching tools for proteins. • Protein secondary structure prediction tools. • Protein structure modeling tools. Cancer Biotechnology Series

  29. Genomics tools • Gene finder and general genome annotation tools. • Cross genome comparison tools and databases. • Large scale sequence assembly and polymorphism identification tools. • Genomic visualization tools (UCSC, NCBI, Ensembl). • Data cleansing tools - vector screening, repeat masking. Cancer Biotechnology Series

  30. Gene expression tools • EST Clustering and differential expression analysis tools and databases. • SAGE Analysis tools and databases. • Microarray data collection, calibration and analysis tools and databases. • Gene clustering and visualization tools. • Integration tools - pathways, regulatory networks and medical literature. • Databases for housing and querying the data. Cancer Biotechnology Series

  31. Proteomics tools • Mass spectroscopy tools for peptide identification. • Fragment classification tools for identification of diagnostics • Peptide fragment resolution tools - identification of protein mixtures from peptide sets. • Databases for storing and querying the data. Cancer Biotechnology Series

  32. Inherent bioinformatics problems • Keeping data sources synchronized and up to date. • Keeping applications up to date. • Remaining aware of current palette of available tools and resources. • Separation between computer developers and biologist users of software and databases. • The silo concept- separate dysfunctional units. • Lack of common language or database schema. Cancer Biotechnology Series

  33. Data Analysis • Pathway analysis • Polymorphism • Proteomics • Image analysis • Homology Modeling • Live polymorphism analysis (if time permits) Cancer Biotechnology Series

  34. Pathway Analysis • Identify specific requirements of individual tumor. • Advance to detection from diagnosis. • Multiple points to cause aberrations and multiple points to act to correct them. • Identify/characterize tissue, cell specific targets. Cancer Biotechnology Series

  35. Pathway Gene Set Analysis • Many experiments result in sets of genes, eg microarray, proteomics, literature searches etc. • Clustering genes based on expression etc. provides only first dimension. • View prospective pathways impacted by changes in expression, protein levels, phosphorylation etc. Cancer Biotechnology Series

  36. G5G8Tg1Liver G5G8Tg2Liver G5G8-/-1Liver G5G8-/-2Liver G5G8-/-3Liver

  37. G5G8Tg1Liver G5G8Tg2Liver G5G8-/-1Liver G5G8-/-2Liver G5G8-/-3Liver

  38. Integrative Strategy for Microarray Analysis Microarray Data Clustering Analysis Load into WPS WSCP Unassigned Genes Integrate with WPS Lists of Genes Assign to uncharacterized pathway(s) Assign to known pathway(s) Putative Pathway PSCP PSCP PSCP

  39. Project Goal: Integrate Biological Data and/or Information Databases into Biological Networks User input: Microarray Data, Proteomics Protein Interaction Database (BIND, DIP etc.) Comparative Genomics P1 P2 Protein Modification Phos., Glyco. Gene regulation (Promoter etc) Gene Ontology SNP & Haplotype Database (SNPinfo etc) Literature DB (e.g. Pubgene ResNet) NCBI resources OMIM etc …… Statistical Evaluation Network Expansion (high, low confidence)

  40. One example of analysis scenario microarray data pathway analysis or clustering in local PC Candidate gene sets Candidate pathway sets Pre-computed DBs or Run-time computed Internet-enabled SNP & Haplotype data (SNPinfo; Disease association Promoter Comparsion 1.CGI generator 2.CoreSearch 3 ConsInspector) Protein interaction Literature-based (Pubgene etc NCBI OMIM etc) GO Known gene training Weighted scoring (Statistic analysis, filtering) Final set of candidate genes (visualization and re-creation of the new subnetwork within the whole network) Pathway expansion

  41. Polymorphism Impacts • Variation within species as great as differences between closely related species • Confounds correlation analysis • Impacts gene structure and expression • Start with complete sequence for individual, obtain polymorphism data for populations/strains and breeds etc. • Strains/breeds allow for good start Cancer Biotechnology Series

  42. Polymorphism Types • SNPs • Indels • STRs • Tandem • NonTandem (Copy number variation) • Retroelement • Complex • Inversion/translocation Cancer Biotechnology Series

  43. STR Polymorphism View Cancer Biotechnology Series

  44. Strain Trace and Contig Coverage View Cancer Biotechnology Series

  45. InDel Polymorphism Information View Cancer Biotechnology Series

  46. Location Polymorphism Locator Query Cancer Biotechnology Series

  47. STR Query results Cancer Biotechnology Series

  48. Polymorphism Visualization Cancer Biotechnology Series

  49. Proteomics InitiativeABCC Projects • Disk Storage and Archiving (centralized storage) • LAN Support • Software Development • Spectral Filtering • Clustering/Biomarker Identification • Database Development and Update • Peptide identification DB • MS Integration with Pathways • ABCC Pathway tool • Provide Scalable Computational Resources • Software Optimization • Sequest (working with LPAT,Yates Lab, and Thermoelectron) Cancer Biotechnology Series

  50. Raw Data Binning Biological Marker Clustering Cancer Biotechnology Series