1 / 32

ILRI/BECA Bioinformatics Platform Introduction

ILRI/BECA Bioinformatics Platform Introduction. Etienne de Villiers ILRI - Kenya. Outline. ILRI/BECA Bioinformatics Platform Hardware Specialized software: Database searching Assembly software CGIAR Bioinformatics Grid. International Livestock Research Institute.

shani
Télécharger la présentation

ILRI/BECA Bioinformatics Platform Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ILRI/BECA Bioinformatics PlatformIntroduction Etienne de Villiers ILRI - Kenya

  2. Outline • ILRI/BECA Bioinformatics Platform • Hardware • Specialized software: • Database searching • Assembly software • CGIAR Bioinformatics Grid

  3. International Livestock Research Institute A lab in Africa at the foot of Kenya’s Ngong Hills

  4. ILRI Research Objectives • Overall mandate is livestock research for poverty alleviation in Africa and South East Asia. • Undertakes a balance of fundamental and applied research with long, medium and short term objectives. • Livestock health, genetics, and management.

  5. ILRI Facilities • State of the art laboratories (2500 m2) • Large and small animal facilities • Level-2/3 biosafety facility for cattle and sheep • Bioinformatics unit • 64 CPU Paracel 64-bit HPC cluster • Sequencing unit • ABI 3730 and ABI 3100 • Microarray facility • Proteomics facility • Oligonucleotide synthesis unit • FACS analysis facility • Tick unit

  6. BECA - Biosciences East and Central Africa • Under NEPAD several centers of excellence are being established in Africa. • One center is being established at ILRI –Biosciences East and Central Africa (BECA). • Center will provide state-of–the-art facilities for scientist in the region. • Facilities include: • Genetics and Genomics lab with high throughput sequencers • Microarray laboratory • Proteomics laboratory • Immunology and molecular biology laboratories • Bioinformatics Platform

  7. ILRI/BECA – Bioinformatics Platform • Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support. • EMBNet node for East and central Africa

  8. IBBP services • Access to bioinformatics tools through either: • web-based bioinformatics tools through the BBP website • secure shell (ssh) access for registered users • Facilities for storage of large datasets • Systems administration and backup of datasets • Training and support in the use of BBP resources • Graduate and Post-graduate Fellowships in Bioinformatics

  9. IBBPFacilities • Training room • 18 computers with MS windows and Linux • High speed internet connection • Servers • 66 CPU Beowulf Linux cluster • High availability Web server

  10. IBBP Website www.becabioinfo.org

  11. Selection of available tools on IBBP • Paracel Blast • GeneMatcher2 • PTA • Oligocheck • EMBOSS 200+ bioinformatics tools • ClustalW multiple alignment software • T-coffee multiple alignment software • FastA sequence alignment tool • HMMER multiple alignment and sequencesearching software • Staden sequence assembly and analysis package • Primer3 primer design package • Paup tree-inference package • Phylip tree-inference package • Phred/Phrap DNA editing and assembly tools • R statistical package • Rosetta – Ab initio protein prediction • SRS – sequence retrieval tool • Etc……

  12. IBBP Hardware Systems HPC Linux cluster 66 CPUs (AMD 64-bit) 72 Gigabyte RAM 3 Terrabyte disk storage • Paracel Blast Machine • Parallel NCBI-Blast (20 CPU ) • Blast • PSI-Blast • Mega-Blast • GeneMatcher2 • 6144 CPU supercomputer • HMM • Smith-Waterman • GeneWise • Profile

  13. Linux cluster • Rocks 4.1 (RedHat) operating system • Platform LSF batch queuing • shares resources equally between users • MPI libraries • Parallel computations Application Software (e.g. BLAST, EMBOSS, Rosetta) Application Integration Middleware (Platform LSF) Batch Queue Setup Operating System (Red Hat - ROCKS) Turnkey HPC Integration Node Node Node Node Node Cluster Build and Configuration Network (GiGE)

  14. Database searching • Heuristic Algorithms (FASTA and BLAST) • Gapped BLAST • Traditional ungapped BLAST • Are fast but give approximate alignments • Dynamic Programming Algorithms • Global – Needleman-Wunsch • Local – Smith-Waterman • Give optimal alignment but are very slow

  15. Paracel Blast Server • Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems • 20 CPU parallel NCBI-Blast • 20x faster than NCBI-Blast server Blastn – Paracel Blast vs. NCBI Blast Query – Chromosome 8 1 sequence 150,000,000 bases Paracel Blast – 1h 9m 56s Database – Human Ref. Seq 10,300 sequences 24,300,000 bases NCBI – 6 days 2h 20m 34s

  16. BioView Viewer Paracel Blast Server

  17. BioView Viewer

  18. Gene Structure Determination • To compare a cDNA or EST database to a genomic database, one must allow introns • Two approaches: • Double-affine Smith-Waterman (separate gap penalty for introns) • Genewise – protein or HMM versus genomic DNA (models the important features of protein families better)

  19. How to get more distant homologs • Use dynamic programming algorithms • Use position-specific or HMM profiles • Do iterated searches • Use translated searches • Must be careful in interpretation (statistics)

  20. GeneMatcher2 • Do things you either can’t or wouldn’t attempt at NCBI (100x faster) • Is a computer specialized for executing calculation intensive methods in bioinformatics: • Especially fast in performing the very sensitive Smith-Waterman pairwise alignment method • compensate for frame shifts • GeneWise • intron- and frameshift-tolerant search method • Needleman-Wunch alignments • HMM searches • 6,144 parallel processor computer

  21. Why GeneMatcher2? • Comparison of sensitivity and selectivity of various sequence search methods • Blue denotes a software method • Yellow denotes a hardware accelerated method Less False positives More true positives

  22. GeneMatcher2 - Performance • Time-to-completion comparison of original methods and methods on GeneMatcher2 • TBLASTX improvement is 20-fold • Other methods at least 100-fold Runtime for an average query 1000 1000 800 600 Seconds 376 400 270 200 16 13 16 4 1 0.1 0 NCBI TBLASTX EBI GeneWise Paracel TBLASTX Decypher HMM Paracel GeneWIse Decypher TBLASTX WUSTL HMM cluster GeneMatcher2 SW FASTA Smith-Waterman * * * Method Source:Genome Canada Bioinformatics Platform Project

  23. BioView Viewer BioView Workbench

  24. BioView Viewer

  25. Assembly Software • Paracel Transcript Assembler (PTA) • High capacity solution for ESTbased transcript reconstruction • Can assemble large numbers of ESTs, allowing for splice variants • Complete pipeline for: sequencecleaning,clustering and assembly • Detection, alignment and visualization of alternative splice forms • Visualization through intuitive graphicalinterfaces

  26. Scientific problems for PTA • Proteomics • Gene discovery • Verify gene predictions for genome assembly • Detecting splice variants • Patterns of expression, tissue specificity • SNP detection • Combinations of all the above...

  27. PTA – Contig view

  28. PTA – Splice variant alignment

  29. Paracel Oligocheck • Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2 • Search oligo’s fast against whole genome • Software used by companies designing and synthesizing oligonucleotides e.g. MWG

  30. Ensemble mirror • Ensembl is a joint project between EMBL - EBI and the Sanger Institute. • A software system which produces and maintains automatic annotation on selected eukaryotic genomes. • Our site provides free access to a selected areas of the data and software from the Ensembl project.

  31. CGIAR – HPC GRID computing ILRI Kenya ICRISAT India 33 nodes Genematcher2 4 nodes 49 nodes 89 CPUs BECA/Partners IRRI Philippines CIP Peru 8 nodes 4 nodes

  32. Thank you

More Related