E N D
1. AMDeC Bioinformatics Core Facility at Columbia University An Introduction
2. Outline of Talk Who are we?
The facility (hardware, software, databases, staff)
Search accelerators
Genome and transcript assembly software
Examples of projects
GeneWays
5. AMDeC Strategy Establish Collaborative Big Science Research Projects
Development of New York's research infrastructure
Establish a Competitive Business Advantage in Biomedical Research
Advance Public Understanding of and Participation in Biomedical Research
Grow New Yorks Biotechnology Sector
6. AMDeC's Five-Year Strategic Initiative for Human Genetics Research in New York State Development of Core Research Facilities
Genetics
Gene Expression
Proteomics
Bioinformatics
Technology Development
Facilitating Access to, and Analysis of, Clinical Populations
Training and Grants to Support Scientists
8. Who are we for? Sophisticated users (bioinformaticians): we provide the tools, specialized hardware, capacity, memory, and parallel processing to run your analyses
Mid-level users: we provide the tools and experts to help you carry out state-of-the art genetic/genomic analysis
Novices: we can help get you started, tell you where to look, and perhaps even suggest new directions for your laboratory
9. HARDWARE Paracel accelerators
GeneMatcher2 fast dynamic programming searches
BlastMachine fast BLAST searches
Other
General purpose front end servers
Sun Sparc V880s (8 CPU production and 2 CPU testing)
Beowulf2 production cluster (Linux, 90 CPU) parallel MPI, standard serial jobs; also smaller testing cluster
File servers, Oracle and MySQL servers, download machine and backup servers
11. SOFTWARE Commercial Packages (Paracel)
GenomeAssembler
TranscriptAssembler
Filtering Package
Academic/Open Source Packages
Sequence Analysis (BioPerl, EMBOSS, clustalW, HMMR-2.2g, MUMmer, T-COFFEE)
Phylogenetic Analysis (PAML, PHYLIP v3.5c)
Statistical Analysis (R, Bioconductor, SJava, SOLAR)
Molecular Mechanics (X-PLOR)
System Software, Relational Database Software
12. DATABASES Most standard nucleotide and peptide databases
NCBI (GenBank), EMBL (IPI), TIGR (ESTs), Wash U (Pfam)
Many complete genome sequences
Over 50 bacteria, sea squirt, Fugu, Tetraodon, C. elegans)
Human (repeats, Golden Path and ensembl assemblies)
Mouse trace database
Human/mouse chromosome assemblies in 100 kb fragments w/ 10 kb overlaps
13. Expert Staff Full-Time
Kenneth Smith, PhD, Proj Mgr, computational biology/software development, genetics databases
Hans-Eric Aronson, MPhil, Sys Admin: software/user support, X-ray crystallography
Pavel Morozov, PhD, Mathematical Biologist/Programmer, consulting, collaborations, evolutionary biology, gene families
Xiaoqing Zhang, MS, Programmer Analyst scientific programming, Web, database design and programming
Stuart Fischer, PhD, Sr Biologist bioinformatics algorithm development, new data source/new product evaluation and integration, consultation with users, physical mapping, automated SNP evaluation
Part-Time
Barry Allen, PhD, Sr Sys Admin, computing systems design/admin
Kristen McFadden, BA, Systems Admin: hardware/operating systems
James J. Russo, PhD, Sr Biologist, consultation and education, sequencing
Mitzi Morris, MS, software design, natural language processing
14. Why use the Core Facility? Do large-scale projects
Large BLAST jobs on BlastMachine; we maintain updated dbs
Search for protein domains/new genes using HMMs/Genewise on GeneMatcher2
Find remote homologs using SW on GeneMatcher or TBLASTX/PSI-BLAST on BlastMachine
Batch jobs (e.g., MegaBlast)
Setting up pipelines, serial jobs
Sophisticated cDNA/genomic alignment and EST assembly with TranscriptAssembler and contig/scaffold assembly with GenomeAssembler
One-stop shopping we maintain links to most important public software
Consulting/Collaboration
Advice on use of hardware/software
Problem solving
Hosting and development of large-scale projects
15. DATABASE SEARCHING Dynamic Programming Algorithms
Global Needleman-Wunsch
Local Smith-Waterman
Give optimal alignment but are very slow
Heuristic Algorithms (FASTA and BLAST)
Gapped BLAST
Traditional ungapped BLAST
Are fast but give approximate alignments
16. How to get more distant homologs Use dynamic programming algorithms
Use position-specific or HMM profiles
Do iterated searches
Use translated searches
Must be careful in interpretation (statistics)
18. Gene Structure To compare a cDNA or EST database to a genomic database, one must allow introns.
Two approaches:
Double-affine Smith-Waterman (separate gap penalty for introns)
Genewise protein or HMM versus genomic DNA (better models the important features of the protein family)
19. The Paracel Machines BlastMachine
Do anything that you can do at NCBI (~20x faster)
Batch jobs
Take advantage of PSI- and PHI-Blast
Search against a 6-frame translation of the entire database
22 dual processor nodes
GeneMatcher2
Do things you either cant or wouldnt attempt at NCBI (100x faster)
DP searches (Smith-Waterman; Needleman-Wunsch)
Profile searches
HMM searches
GeneWise intron- and frameshift-tolerant searches
9216 parallel processing cells ASIC technology
20. SEARCH TOOLS AVAILABLE BLAST Algorithms (moderate sensitivity, high specificity)
BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX, MEGABLAST
Smith Waterman Algorithms (high sensitivity and specificity)
SWN (linear, affine, and double affine gap penalties)
SWP (affine and double affine)
SWX (framesearch, reverse framesearch)
Profile Algorithms (high sensitivity/specificity distant homology)
Gribskov profile search and protein frame search
HMM and HMM frame search
GeneWise, PSI-BLAST, PHI-BLAST
22. ASSEMBLY SOFTWARE GenomeAssembler
Good for assembly of up to at least microbial genomes
Uses forward-reverse constraints
TranscriptAssembler
Can assemble large numbers of ESTs, allowing for splice variants
37. Some Project Examples EST screening
Gene discovery/SNP identification
Genome annotation
Gene evolutionary analysis
38. EST Screening Paracel TranscriptAssembler (PTA) software used by John Edwards in Dr. Jingyue Jus laboratory to cluster and assemble 35,000 sequences from more than 20,000 Aplysia cDNA EST clones.
Produced about 11,000 non-redundant sequences.
These sequences were then screened using TBLASTX on the BlastMachine against the nt (nonredundant nucleotide) database. This gave more informative results than searching against the protein database.
40. Mouse Genomic-Reads Screening for SNP Identification Dr. Stuart Fischer, collaborating with Dr. Rudy Leibel, screened 1000 transcripts in 7 genetically defined intervals against mouse genomic reads database (>41 million reads) and other sources (4 strains total) takes about 15 hours on the BlastMachine.
Multiple sequence alignment built from genomic reads using Paracel GenomeAssembler (PGA).
Strain-specific differences revealed by Calypso (PGA package) and in-house software.
Amino acid substitutions which affect potentially important structural changes are being predicted using the PrISM software package developed by Dr. An-Suei Yang of the CGC.
46. Legionella Genome Annotation
Dr. James Russo's lab in the CGC is sequencing the genome of L. pneumophila, the bacterium responsible for Legionnaires disease.
Every few weeks all contigs and unassembled sequences are searched against various NCBI and local databases. Overnight run on the BlastMachine.
PGA is being used as adjunct to phrap, to take advantage of its ability to use pairwise constraints.
48. Epilepsy Gene Evolutionary Analysis Dr. Pavel Morozov of the Facility staff collaborated with Drs. Ruth Ottman, Conrad Gilliam and Sergey Kalachikov of the Columbia Genome Center to characterize a new gene family (LGI), one member of which (LGI1) causes a rare form of epilepsy.
The BlastMachine and GeneMatcher2 (Smith-Waterman, HMMER) were used intensively to search for distant homologs.
Comparison of transcribed sequences from genomic regions of about 10 Mb around the LGI family members was performed using the BlastMachine.
50. Other Projects Bacterial Enzyme Family Screening
Thousands of iterative HMM searches for potential RNA-related enzymes in a bacterial genome using RNA binding domains and known enzyme families. The goal is to discover new family members.
Anopheles gambiae Genome Analysis
Dr. Andrey Rzhetsky conducted an analysis of four gene families and their exon/intron structure in the newly sequenced genome of Anopheles gambiae using the HMMER and Genewise algorithms on the GeneMatcher2.
51. GeneWay Parsing literature to reveal gene interaction pathways Ontology consists of molecular actions (e.g., activate, acetylate, express, methylate, etc., etc.) including many subcategories and synonyms
The entire database (currently containing 2.7 million statements, 1.5 million unique; and 700,000 substances) is available for search with any term in the ontology
Filtering options reduce magnitude of output
58. How to reach us Project Website
http://amdec-bioinfo.cu-genome.org
Project Manager Kenneth Smith
New accounts; new projects
kcs3@columbia.edu
Operational problems Hans-Eric Aronson
hga1@columbia.edu
Other scientific questions
Pavel Morozov pm259@columbia.edu
Stuart Fischer sgf2@columbia.edu
Jim Russo jjr4@columbia.edu