320 likes | 462 Vues
The ILRI/BECA Bioinformatics Platform, located at the International Livestock Research Institute in Kenya, provides advanced resources for livestock research aimed at poverty alleviation in Africa and Southeast Asia. With state-of-the-art laboratories and high-performance computing capabilities, the platform offers bioinformatics tools, large-volume data storage, and training for East and Central African scientists. Key facilities include genetics labs, sequencing units, and support for applications like genomic analysis and data management, fostering collaborative research to enhance livestock health, genetics, and management.
E N D
ILRI/BECA Bioinformatics PlatformIntroduction Etienne de Villiers ILRI - Kenya
Outline • ILRI/BECA Bioinformatics Platform • Hardware • Specialized software: • Database searching • Assembly software • CGIAR Bioinformatics Grid
International Livestock Research Institute A lab in Africa at the foot of Kenya’s Ngong Hills
ILRI Research Objectives • Overall mandate is livestock research for poverty alleviation in Africa and South East Asia. • Undertakes a balance of fundamental and applied research with long, medium and short term objectives. • Livestock health, genetics, and management.
ILRI Facilities • State of the art laboratories (2500 m2) • Large and small animal facilities • Level-2/3 biosafety facility for cattle and sheep • Bioinformatics unit • 64 CPU Paracel 64-bit HPC cluster • Sequencing unit • ABI 3730 and ABI 3100 • Microarray facility • Proteomics facility • Oligonucleotide synthesis unit • FACS analysis facility • Tick unit
BECA - Biosciences East and Central Africa • Under NEPAD several centers of excellence are being established in Africa. • One center is being established at ILRI –Biosciences East and Central Africa (BECA). • Center will provide state-of–the-art facilities for scientist in the region. • Facilities include: • Genetics and Genomics lab with high throughput sequencers • Microarray laboratory • Proteomics laboratory • Immunology and molecular biology laboratories • Bioinformatics Platform
ILRI/BECA – Bioinformatics Platform • Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support. • EMBNet node for East and central Africa
IBBP services • Access to bioinformatics tools through either: • web-based bioinformatics tools through the BBP website • secure shell (ssh) access for registered users • Facilities for storage of large datasets • Systems administration and backup of datasets • Training and support in the use of BBP resources • Graduate and Post-graduate Fellowships in Bioinformatics
IBBPFacilities • Training room • 18 computers with MS windows and Linux • High speed internet connection • Servers • 66 CPU Beowulf Linux cluster • High availability Web server
IBBP Website www.becabioinfo.org
Selection of available tools on IBBP • Paracel Blast • GeneMatcher2 • PTA • Oligocheck • EMBOSS 200+ bioinformatics tools • ClustalW multiple alignment software • T-coffee multiple alignment software • FastA sequence alignment tool • HMMER multiple alignment and sequencesearching software • Staden sequence assembly and analysis package • Primer3 primer design package • Paup tree-inference package • Phylip tree-inference package • Phred/Phrap DNA editing and assembly tools • R statistical package • Rosetta – Ab initio protein prediction • SRS – sequence retrieval tool • Etc……
IBBP Hardware Systems HPC Linux cluster 66 CPUs (AMD 64-bit) 72 Gigabyte RAM 3 Terrabyte disk storage • Paracel Blast Machine • Parallel NCBI-Blast (20 CPU ) • Blast • PSI-Blast • Mega-Blast • GeneMatcher2 • 6144 CPU supercomputer • HMM • Smith-Waterman • GeneWise • Profile
Linux cluster • Rocks 4.1 (RedHat) operating system • Platform LSF batch queuing • shares resources equally between users • MPI libraries • Parallel computations Application Software (e.g. BLAST, EMBOSS, Rosetta) Application Integration Middleware (Platform LSF) Batch Queue Setup Operating System (Red Hat - ROCKS) Turnkey HPC Integration Node Node Node Node Node Cluster Build and Configuration Network (GiGE)
Database searching • Heuristic Algorithms (FASTA and BLAST) • Gapped BLAST • Traditional ungapped BLAST • Are fast but give approximate alignments • Dynamic Programming Algorithms • Global – Needleman-Wunsch • Local – Smith-Waterman • Give optimal alignment but are very slow
Paracel Blast Server • Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems • 20 CPU parallel NCBI-Blast • 20x faster than NCBI-Blast server Blastn – Paracel Blast vs. NCBI Blast Query – Chromosome 8 1 sequence 150,000,000 bases Paracel Blast – 1h 9m 56s Database – Human Ref. Seq 10,300 sequences 24,300,000 bases NCBI – 6 days 2h 20m 34s
BioView Viewer Paracel Blast Server
Gene Structure Determination • To compare a cDNA or EST database to a genomic database, one must allow introns • Two approaches: • Double-affine Smith-Waterman (separate gap penalty for introns) • Genewise – protein or HMM versus genomic DNA (models the important features of protein families better)
How to get more distant homologs • Use dynamic programming algorithms • Use position-specific or HMM profiles • Do iterated searches • Use translated searches • Must be careful in interpretation (statistics)
GeneMatcher2 • Do things you either can’t or wouldn’t attempt at NCBI (100x faster) • Is a computer specialized for executing calculation intensive methods in bioinformatics: • Especially fast in performing the very sensitive Smith-Waterman pairwise alignment method • compensate for frame shifts • GeneWise • intron- and frameshift-tolerant search method • Needleman-Wunch alignments • HMM searches • 6,144 parallel processor computer
Why GeneMatcher2? • Comparison of sensitivity and selectivity of various sequence search methods • Blue denotes a software method • Yellow denotes a hardware accelerated method Less False positives More true positives
GeneMatcher2 - Performance • Time-to-completion comparison of original methods and methods on GeneMatcher2 • TBLASTX improvement is 20-fold • Other methods at least 100-fold Runtime for an average query 1000 1000 800 600 Seconds 376 400 270 200 16 13 16 4 1 0.1 0 NCBI TBLASTX EBI GeneWise Paracel TBLASTX Decypher HMM Paracel GeneWIse Decypher TBLASTX WUSTL HMM cluster GeneMatcher2 SW FASTA Smith-Waterman * * * Method Source:Genome Canada Bioinformatics Platform Project
BioView Viewer BioView Workbench
Assembly Software • Paracel Transcript Assembler (PTA) • High capacity solution for ESTbased transcript reconstruction • Can assemble large numbers of ESTs, allowing for splice variants • Complete pipeline for: sequencecleaning,clustering and assembly • Detection, alignment and visualization of alternative splice forms • Visualization through intuitive graphicalinterfaces
Scientific problems for PTA • Proteomics • Gene discovery • Verify gene predictions for genome assembly • Detecting splice variants • Patterns of expression, tissue specificity • SNP detection • Combinations of all the above...
Paracel Oligocheck • Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2 • Search oligo’s fast against whole genome • Software used by companies designing and synthesizing oligonucleotides e.g. MWG
Ensemble mirror • Ensembl is a joint project between EMBL - EBI and the Sanger Institute. • A software system which produces and maintains automatic annotation on selected eukaryotic genomes. • Our site provides free access to a selected areas of the data and software from the Ensembl project.
CGIAR – HPC GRID computing ILRI Kenya ICRISAT India 33 nodes Genematcher2 4 nodes 49 nodes 89 CPUs BECA/Partners IRRI Philippines CIP Peru 8 nodes 4 nodes