1 / 57

AMDeC Bioinformatics Core Facility at Columbia University

sorcha
Télécharger la présentation

AMDeC Bioinformatics Core Facility at Columbia University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. AMDeC Bioinformatics Core Facility at Columbia University An Introduction

    2. Outline of Talk Who are we? The facility (hardware, software, databases, staff) Search accelerators Genome and transcript assembly software Examples of projects GeneWays

    5. AMDeC Strategy Establish Collaborative Big Science Research Projects Development of New York's research infrastructure Establish a Competitive Business Advantage in Biomedical Research Advance Public Understanding of and Participation in Biomedical Research Grow New Yorks Biotechnology Sector

    6. AMDeC's Five-Year Strategic Initiative for Human Genetics Research in New York State Development of Core Research Facilities Genetics Gene Expression Proteomics Bioinformatics Technology Development Facilitating Access to, and Analysis of, Clinical Populations Training and Grants to Support Scientists

    8. Who are we for? Sophisticated users (bioinformaticians): we provide the tools, specialized hardware, capacity, memory, and parallel processing to run your analyses Mid-level users: we provide the tools and experts to help you carry out state-of-the art genetic/genomic analysis Novices: we can help get you started, tell you where to look, and perhaps even suggest new directions for your laboratory

    9. HARDWARE Paracel accelerators GeneMatcher2 fast dynamic programming searches BlastMachine fast BLAST searches Other General purpose front end servers Sun Sparc V880s (8 CPU production and 2 CPU testing) Beowulf2 production cluster (Linux, 90 CPU) parallel MPI, standard serial jobs; also smaller testing cluster File servers, Oracle and MySQL servers, download machine and backup servers

    11. SOFTWARE Commercial Packages (Paracel) GenomeAssembler TranscriptAssembler Filtering Package Academic/Open Source Packages Sequence Analysis (BioPerl, EMBOSS, clustalW, HMMR-2.2g, MUMmer, T-COFFEE) Phylogenetic Analysis (PAML, PHYLIP v3.5c) Statistical Analysis (R, Bioconductor, SJava, SOLAR) Molecular Mechanics (X-PLOR) System Software, Relational Database Software

    12. DATABASES Most standard nucleotide and peptide databases NCBI (GenBank), EMBL (IPI), TIGR (ESTs), Wash U (Pfam) Many complete genome sequences Over 50 bacteria, sea squirt, Fugu, Tetraodon, C. elegans) Human (repeats, Golden Path and ensembl assemblies) Mouse trace database Human/mouse chromosome assemblies in 100 kb fragments w/ 10 kb overlaps

    13. Expert Staff Full-Time Kenneth Smith, PhD, Proj Mgr, computational biology/software development, genetics databases Hans-Eric Aronson, MPhil, Sys Admin: software/user support, X-ray crystallography Pavel Morozov, PhD, Mathematical Biologist/Programmer, consulting, collaborations, evolutionary biology, gene families Xiaoqing Zhang, MS, Programmer Analyst scientific programming, Web, database design and programming Stuart Fischer, PhD, Sr Biologist bioinformatics algorithm development, new data source/new product evaluation and integration, consultation with users, physical mapping, automated SNP evaluation Part-Time Barry Allen, PhD, Sr Sys Admin, computing systems design/admin Kristen McFadden, BA, Systems Admin: hardware/operating systems James J. Russo, PhD, Sr Biologist, consultation and education, sequencing Mitzi Morris, MS, software design, natural language processing

    14. Why use the Core Facility? Do large-scale projects Large BLAST jobs on BlastMachine; we maintain updated dbs Search for protein domains/new genes using HMMs/Genewise on GeneMatcher2 Find remote homologs using SW on GeneMatcher or TBLASTX/PSI-BLAST on BlastMachine Batch jobs (e.g., MegaBlast) Setting up pipelines, serial jobs Sophisticated cDNA/genomic alignment and EST assembly with TranscriptAssembler and contig/scaffold assembly with GenomeAssembler One-stop shopping we maintain links to most important public software Consulting/Collaboration Advice on use of hardware/software Problem solving Hosting and development of large-scale projects

    15. DATABASE SEARCHING Dynamic Programming Algorithms Global Needleman-Wunsch Local Smith-Waterman Give optimal alignment but are very slow Heuristic Algorithms (FASTA and BLAST) Gapped BLAST Traditional ungapped BLAST Are fast but give approximate alignments

    16. How to get more distant homologs Use dynamic programming algorithms Use position-specific or HMM profiles Do iterated searches Use translated searches Must be careful in interpretation (statistics)

    18. Gene Structure To compare a cDNA or EST database to a genomic database, one must allow introns. Two approaches: Double-affine Smith-Waterman (separate gap penalty for introns) Genewise protein or HMM versus genomic DNA (better models the important features of the protein family)

    19. The Paracel Machines BlastMachine Do anything that you can do at NCBI (~20x faster) Batch jobs Take advantage of PSI- and PHI-Blast Search against a 6-frame translation of the entire database 22 dual processor nodes GeneMatcher2 Do things you either cant or wouldnt attempt at NCBI (100x faster) DP searches (Smith-Waterman; Needleman-Wunsch) Profile searches HMM searches GeneWise intron- and frameshift-tolerant searches 9216 parallel processing cells ASIC technology

    20. SEARCH TOOLS AVAILABLE BLAST Algorithms (moderate sensitivity, high specificity) BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX, MEGABLAST Smith Waterman Algorithms (high sensitivity and specificity) SWN (linear, affine, and double affine gap penalties) SWP (affine and double affine) SWX (framesearch, reverse framesearch) Profile Algorithms (high sensitivity/specificity distant homology) Gribskov profile search and protein frame search HMM and HMM frame search GeneWise, PSI-BLAST, PHI-BLAST

    22. ASSEMBLY SOFTWARE GenomeAssembler Good for assembly of up to at least microbial genomes Uses forward-reverse constraints TranscriptAssembler Can assemble large numbers of ESTs, allowing for splice variants

    37. Some Project Examples EST screening Gene discovery/SNP identification Genome annotation Gene evolutionary analysis

    38. EST Screening Paracel TranscriptAssembler (PTA) software used by John Edwards in Dr. Jingyue Jus laboratory to cluster and assemble 35,000 sequences from more than 20,000 Aplysia cDNA EST clones. Produced about 11,000 non-redundant sequences. These sequences were then screened using TBLASTX on the BlastMachine against the nt (nonredundant nucleotide) database. This gave more informative results than searching against the protein database.

    40. Mouse Genomic-Reads Screening for SNP Identification Dr. Stuart Fischer, collaborating with Dr. Rudy Leibel, screened 1000 transcripts in 7 genetically defined intervals against mouse genomic reads database (>41 million reads) and other sources (4 strains total) takes about 15 hours on the BlastMachine. Multiple sequence alignment built from genomic reads using Paracel GenomeAssembler (PGA). Strain-specific differences revealed by Calypso (PGA package) and in-house software. Amino acid substitutions which affect potentially important structural changes are being predicted using the PrISM software package developed by Dr. An-Suei Yang of the CGC.

    46. Legionella Genome Annotation Dr. James Russo's lab in the CGC is sequencing the genome of L. pneumophila, the bacterium responsible for Legionnaires disease. Every few weeks all contigs and unassembled sequences are searched against various NCBI and local databases. Overnight run on the BlastMachine. PGA is being used as adjunct to phrap, to take advantage of its ability to use pairwise constraints.

    48. Epilepsy Gene Evolutionary Analysis Dr. Pavel Morozov of the Facility staff collaborated with Drs. Ruth Ottman, Conrad Gilliam and Sergey Kalachikov of the Columbia Genome Center to characterize a new gene family (LGI), one member of which (LGI1) causes a rare form of epilepsy. The BlastMachine and GeneMatcher2 (Smith-Waterman, HMMER) were used intensively to search for distant homologs. Comparison of transcribed sequences from genomic regions of about 10 Mb around the LGI family members was performed using the BlastMachine.

    50. Other Projects Bacterial Enzyme Family Screening Thousands of iterative HMM searches for potential RNA-related enzymes in a bacterial genome using RNA binding domains and known enzyme families. The goal is to discover new family members. Anopheles gambiae Genome Analysis Dr. Andrey Rzhetsky conducted an analysis of four gene families and their exon/intron structure in the newly sequenced genome of Anopheles gambiae using the HMMER and Genewise algorithms on the GeneMatcher2.

    51. GeneWay Parsing literature to reveal gene interaction pathways Ontology consists of molecular actions (e.g., activate, acetylate, express, methylate, etc., etc.) including many subcategories and synonyms The entire database (currently containing 2.7 million statements, 1.5 million unique; and 700,000 substances) is available for search with any term in the ontology Filtering options reduce magnitude of output

    58. How to reach us Project Website http://amdec-bioinfo.cu-genome.org Project Manager Kenneth Smith New accounts; new projects kcs3@columbia.edu Operational problems Hans-Eric Aronson hga1@columbia.edu Other scientific questions Pavel Morozov pm259@columbia.edu Stuart Fischer sgf2@columbia.edu Jim Russo jjr4@columbia.edu

More Related