Bioinformatics Overview: Tools and Techniques for Biological Data Analysis
280 likes | 389 Vues
Explore bioinformatics, the storage, retrieval, and analysis of biological information like DNA sequences and protein structures. Discover databases, annotation, gene ontology, and more in this comprehensive field.
Bioinformatics Overview: Tools and Techniques for Biological Data Analysis
E N D
Presentation Transcript
Bioinformatics Overview School of B&I TCD May 2010
Who, me? • Andrew Lloyd • atlloyd@tcd.ie • 087-225-9850, 053-9255717, 01-896-2450 • Director INCBI 1993-2000 • Population genetics, evolution • Whole genome analysis • Immunology, chickens, FIRM
Definition/scope • Storage, retrieval and analysis of biological (sequence) information. • Insert better definition here • Case can be made for microarray analysis • NOT • ecoinformatics (ecology) • Image analysis • Bar-coding hospital sheets
Philosophy “Nothing worth learning can be taught” Oscar Wilde
Getting bioinformation • Type it in: A,T,C,C,G,T,C,A (1991) • Access databases • Literature (Pubmed) • Medical (OMIM) • DNA sequence (EMBL/GenBank) • Protein sequence (UniProt, SwissProt, PIR) • 3-D structure (PDB)
Annotation • In any DB, half is data and half context. • Gene ontology (language) • Parsing sequence (ORF, RBS, Intron, -helix) • Recognising similar sequences (evolution!) • Complementary info : DB cross-referencing • (DNA -> Protein -> 3D structure -> motifs)
Secondary databases • Protein motifs, domains, families • RNA structures (16S ribosomal RNA…) • Taxonomy/classification • Metabolic pathways (KEGG) • Enzymes (Brenda, TCD, Ireland) • SNPs: mutations and variants • Disease DBs (OMIM) • Immuno, epitope DBs
Complete genomes • Ensembl (complex, basically vertebrate) • Uniform look-and-feel; cross-refs • UCSC GoldenPath browser • Plants • Bacterial genomes • Including mitochondrial, chloroplast • Eubacteria vs Archaea vs Eukaryotes
Annotated/known genes • What does my gene do? • Blast (fasta) against the DB • SRS/Entrez to access databases • Neighboring (similar things in same DB) • DB cross-references • full picture of attributes • What biochemical pathway?
FullTextJournals OMIM GenBank/EMBLDNA Sequence UniProt Protein sequence PubMed Maps & Genomes Prosite Pfam PSSM PDB 3-D struct Taxonomy The territory
Databases • BIG • EMBL/GenBank 200Gbp, 100m entries, 2500 complete genomes, 200K species • Encycl. Britannica 180m letters. 40m words • EMBL 1km of Britannica Volumes • Doubling every 14-18 mo • Human genome is X bp?
Intrinsic vs Context Internal • DNA, protein sequence • DNA: Purine/Pyrimidine • AAs: small, hydrophobic, aromatic, polar • Variants: SNPs, Indels, Alt Splicing • 2ndry structure • DNA: stem/loops • Protein: helix, sheet, turn, loop
Intrinsic vs Context External, context for your molecule • In other species (homologs, phylog trees) • In which cell • In which cellular location (GO) • Molecular complex (dimers) • Which pathway (KEGG) • Where in genome (neighbors, synteny)
New Unknown Gene • Blast homology searching • Genomic location/neighboring genes • Where is it expressed? • How regulated (control sequences) • Intron/exon structure • Domain structure • Restriction sites etc. • Primer design
DNA/gene structure • Four bases A T C G U • 2 pyrimidine, 2 purine • LOTS of them: how many? • Open reading frame • 5’ signals, 3’ signals • Introns/exons • Neighbours (operons)
Two sequences • Alignment • Local • Global • Dotplot • Threading
One seq vs many • Homology search vs database • Special case of 2-seq alignment • Blast vs fasta • Limit by species/taxon • Substitution matrices • Low complexity masking
Multiple sequence alignment • MSA • Progressive alignment • ClustalW or (better) T-Coffee
Phylogenetic trees • Computationally intensive • Distance matrix methods • Neighbor-joining (NJ) • UPGMA • Minimum evolution • Maximum parsimony • Maximum likelihood • Bayesian methods
Genefinding • Special case of DNA analysis • How to annotate a genome • Bacterial • Find open reading frames (ORFs) • With start/stop codons • With promoter, RBS, CAAT, TATA • Eukaryotic • As above PLUS • Introns/exons • Alternative splicing
Typical mammalian gene structure miRNAs? Introns Start (ATG) Stop ControlRegion DNA gt.. …ag 5’ 3’ Exon 2 Exon 3 Exon 4 Exon 1 Introns “spliced out” and discarded RNA RNA Stop: TAG, TGA, TAA ATGCCCAGGAGATTTGGA . . . MetProArgArgPheGly . . . PROTEIN
Protein substructure • DNA makes protein and protein (enzymes) make everything else. • 20 Amino acids • Amino acid properties • Motifs • Domains • Biological units
Protein 3-D structure • Relationship between sequence & structure • Secondary structure • Alpha helix • Beta sheet • Coil • Turn • Threading sequence to homologous structure
Gene Expression • EST • SAGE • MicroArray • Clustering of same expressed genes
Genomics • Complete DNA seq for a species • Gene order • Gene clusters/operons • Missing operons • Gene duplication • Whole genome duplication (WGD)
SNPs • Key issue in genetics is that two organisms are both the same and different: • Humans vs chimps vs mouse • Parent vs offspring vs co-national vs human • Single nucleotide polymorphisms • Variation between individuals • Pharmacogenetics • Personal tailored medicine
Summary/take home • Course designed to give you access to databases, software tools • …and ways of thinking about data