AMDeC Bioinformatics Core Facility at Columbia University

1. AMDeC Bioinformatics Core Facility at Columbia University An Introduction

2. Outline of Talk Who are we? The facility (hardware, software, databases, staff) Search accelerators Genome and transcript assembly software Examples of projects GeneWays

5. AMDeC Strategy Establish Collaborative �Big Science� Research Projects Development of New York's research infrastructure Establish a Competitive Business Advantage in Biomedical Research Advance Public Understanding of and Participation in Biomedical Research Grow New York�s Biotechnology Sector

6. AMDeC's Five-Year Strategic Initiative for Human Genetics Research in New York State Development of Core Research Facilities Genetics Gene Expression Proteomics Bioinformatics Technology Development Facilitating Access to, and Analysis of, Clinical Populations Training and Grants to Support Scientists

8. Who are we for? Sophisticated users (bioinformaticians): we provide the tools, specialized hardware, capacity, memory, and parallel processing to run your analyses Mid-level users: we provide the tools and experts to help you carry out state-of-the art genetic/genomic analysis Novices: we can help get you started, tell you where to look, and perhaps even suggest new directions for your laboratory

9. HARDWARE Paracel accelerators GeneMatcher2 � fast dynamic programming searches BlastMachine � fast BLAST searches Other General purpose front end servers Sun Sparc V880s (8 CPU production and 2 CPU testing) Beowulf2 production cluster (Linux, 90 CPU) � parallel MPI, standard serial jobs; also smaller testing cluster File servers, Oracle and MySQL servers, download machine and backup servers

11. SOFTWARE Commercial Packages (Paracel) GenomeAssembler TranscriptAssembler Filtering Package Academic/Open Source Packages Sequence Analysis (BioPerl, EMBOSS, clustalW, HMMR-2.2g, MUMmer, T-COFFEE) Phylogenetic Analysis (PAML, PHYLIP v3.5c) Statistical Analysis (R, Bioconductor, SJava, SOLAR) Molecular Mechanics (X-PLOR) System Software, Relational Database Software

12. DATABASES Most standard nucleotide and peptide databases NCBI (GenBank), EMBL (IPI), TIGR (ESTs), Wash U (Pfam) Many complete genome sequences Over 50 bacteria, sea squirt, Fugu, Tetraodon, C. elegans) Human (repeats, Golden Path and ensembl assemblies) Mouse trace database Human/mouse chromosome assemblies in 100 kb fragments w/ 10 kb overlaps

13. Expert Staff Full-Time Kenneth Smith, PhD, Proj Mgr, computational biology/software development, genetics databases Hans-Eric Aronson, MPhil, Sys Admin: software/user support, X-ray crystallography Pavel Morozov, PhD, Mathematical Biologist/Programmer, consulting, collaborations, evolutionary biology, gene families Xiaoqing Zhang, MS, Programmer Analyst � scientific programming, Web, database design and programming Stuart Fischer, PhD, Sr Biologist � bioinformatics algorithm development, new data source/new product evaluation and integration, consultation with users, physical mapping, automated SNP evaluation Part-Time Barry Allen, PhD, Sr Sys Admin, computing systems design/admin Kristen McFadden, BA, Systems Admin: hardware/operating systems James J. Russo, PhD, Sr Biologist, consultation and education, sequencing Mitzi Morris, MS, software design, natural language processing

14. Why use the Core Facility? Do large-scale projects Large BLAST jobs on BlastMachine; we maintain updated db�s Search for protein domains/new genes using HMMs/Genewise on GeneMatcher2 Find remote homologs using SW on GeneMatcher or TBLASTX/PSI-BLAST on BlastMachine Batch jobs (e.g., MegaBlast) Setting up pipelines, serial jobs Sophisticated cDNA/genomic alignment and EST assembly with TranscriptAssembler and contig/scaffold assembly with GenomeAssembler One-stop shopping � we maintain links to most important public software Consulting/Collaboration Advice on use of hardware/software Problem solving Hosting and development of large-scale projects

15. DATABASE SEARCHING Dynamic Programming Algorithms Global � Needleman-Wunsch Local � Smith-Waterman Give optimal alignment but are very slow Heuristic Algorithms (FASTA and BLAST) Gapped BLAST Traditional ungapped BLAST Are fast but give approximate alignments

16. How to get more distant homologs Use dynamic programming algorithms Use position-specific or HMM profiles Do iterated searches Use translated searches Must be careful in interpretation (statistics)

18. Gene Structure To compare a cDNA or EST database to a genomic database, one must allow introns. Two approaches: Double-affine Smith-Waterman (separate gap penalty for introns) Genewise � protein or HMM versus genomic DNA (better models the important features of the protein family)

19. The Paracel Machines BlastMachine Do anything that you can do at NCBI (~20x faster) Batch jobs Take advantage of PSI- and PHI-Blast Search against a 6-frame translation of the entire database 22 dual processor nodes GeneMatcher2 Do things you either can�t or wouldn�t attempt at NCBI (100x faster) DP searches (Smith-Waterman; Needleman-Wunsch) Profile searches HMM searches GeneWise intron- and frameshift-tolerant searches 9216 parallel processing cells � ASIC technology

20. SEARCH TOOLS AVAILABLE BLAST Algorithms (moderate sensitivity, high specificity) BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX, MEGABLAST Smith Waterman Algorithms (high sensitivity and specificity) SWN (linear, affine, and double affine gap penalties) SWP (affine and double affine) SWX (framesearch, reverse framesearch) Profile Algorithms (high sensitivity/specificity distant homology) Gribskov profile search and protein frame search HMM and HMM frame search GeneWise, PSI-BLAST, PHI-BLAST

22. ASSEMBLY SOFTWARE GenomeAssembler Good for assembly of up to at least microbial genomes Uses forward-reverse constraints TranscriptAssembler Can assemble large numbers of ESTs, allowing for splice variants

37. Some Project Examples EST screening Gene discovery/SNP identification Genome annotation Gene evolutionary analysis

38. EST Screening Paracel TranscriptAssembler (PTA) software used by John Edwards in Dr. Jingyue Ju�s laboratory to cluster and assemble 35,000 sequences from more than 20,000 Aplysia cDNA EST clones. Produced about 11,000 non-redundant sequences. These sequences were then screened using TBLASTX on the BlastMachine against the nt (nonredundant nucleotide) database. This gave more informative results than searching against the protein database.

40. Mouse Genomic-Reads Screening for SNP Identification Dr. Stuart Fischer, collaborating with Dr. Rudy Leibel, screened 1000 transcripts in 7 genetically defined intervals against mouse genomic reads database (>41 million reads) and other sources (4 strains total) � takes about 15 hours on the BlastMachine. Multiple sequence alignment built from genomic reads using Paracel GenomeAssembler (PGA). Strain-specific differences revealed by Calypso (PGA package) and in-house software. Amino acid substitutions which affect potentially important structural changes are being predicted using the PrISM software package developed by Dr. An-Suei Yang of the CGC.

46. Legionella Genome Annotation Dr. James Russo's lab in the CGC is sequencing the genome of L. pneumophila, the bacterium responsible for Legionnaires� disease. Every few weeks all contigs and unassembled sequences are searched against various NCBI and local databases. Overnight run on the BlastMachine. PGA is being used as adjunct to phrap, to take advantage of its ability to use pairwise constraints.

48. Epilepsy Gene Evolutionary Analysis Dr. Pavel Morozov of the Facility staff collaborated with Drs. Ruth Ottman, Conrad Gilliam and Sergey Kalachikov of the Columbia Genome Center to characterize a new gene family (LGI), one member of which (LGI1) causes a rare form of epilepsy. The BlastMachine and GeneMatcher2 (Smith-Waterman, HMMER) were used intensively to search for distant homologs. Comparison of transcribed sequences from genomic regions of about 10 Mb around the LGI family members was performed using the BlastMachine.

50. Other Projects Bacterial Enzyme Family Screening Thousands of iterative HMM searches for potential RNA-related enzymes in a bacterial genome using RNA binding domains and known enzyme families. The goal is to discover new family members. Anopheles gambiae Genome Analysis Dr. Andrey Rzhetsky conducted an analysis of four gene families and their exon/intron structure in the newly sequenced genome of Anopheles gambiae using the HMMER and Genewise algorithms on the GeneMatcher2.

51. GeneWay � Parsing literature to reveal gene interaction pathways Ontology � consists of molecular actions (e.g., activate, acetylate, express, methylate, etc., etc.) including many subcategories and synonyms The entire database (currently containing 2.7 million statements, 1.5 million unique; and 700,000 substances) is available for search with any term in the ontology Filtering options reduce magnitude of output

58. How to reach us Project Website http://amdec-bioinfo.cu-genome.org Project Manager � Kenneth Smith New accounts; new projects kcs3@columbia.edu Operational problems � Hans-Eric Aronson hga1@columbia.edu Other scientific questions Pavel Morozov � pm259@columbia.edu Stuart Fischer � sgf2@columbia.edu Jim Russo � jjr4@columbia.edu

AMDeC Bioinformatics Core Facility at Columbia University

AMDeC Bioinformatics Core Facility at Columbia University

Presentation Transcript

Bioinformatics lectures at Rice University

Core 2: Bioinformatics

Bioinformatics lectures at Rice University

Bioinformatics Facility at the Biotechnology/Bioservices Center

Bioinformatics in the CDC Biotechnology Core Facility Branch

Bioinformatics lectures at Rice University

Bioinformatics lectures at Rice University

Bioinformatics lectures at Rice University

Bioinformatics lectures at Rice University

Bioinformatics lectures at Rice University

Cornell University Bioinformatics Facility

Bioinformatics Core at Purdue University

Cornell University Bioinformatics Facility

Gladstone Bioinformatics Core

Biostatistics Bioinformatics Core

Core 2: Bioinformatics

Bioinformatics Core

Bioinformatics Core Facility

Bioinformatics Core Facility Ernesto Lowy

Bioinformatics and Computational Biology Core Facility

Bioinformatics Core

Core 2: Bioinformatics