550 likes | 782 Vues
Exercise: BIOINFORMATIC DATABASES and BLAST. Outline. NCBI and Entrez Pubmed Google scholar RefSeq Swissprot Fasta format PDB : Protein Data Bank Organism specific databases Summary Pairwise Sequence Alignment and BLAST Overview Query type: DNA or Protein. What’s in a database?.
E N D
Outline • NCBI and Entrez • Pubmed • Google scholar • RefSeq • Swissprot • Fasta format • PDB: Protein Data Bank • Organism specific databases • Summary • Pairwise Sequence Alignment and BLAST • Overview • Query type: DNA or Protein
What’s in a database? • Sequences – genes, proteins, etc • Full genomes • Annotation – information about genes/proteins:- function- cellular location- chromosomal location- introns/exons- protein structure- phenotypes, diseases • Publications
NCBI and EntrezNational center for biotechnology information • One of the largest and most comprehensive databases belonging to the NIH (national institute of health) • The primary Federal agency for conducting and supporting medical research in the USA • Entrez is the search engine of NCBI • Search for :genes, proteins, genomes, structures, diseases, publications and more • http://www.ncbi.nlm.nih.gov/
PubMed: search for published papers • Yang X, Kurteva S, Ren X, Lee S,Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags
Exercise • Retrieve all publications in which the first author is:Pe'er I and the last author is: Shamir R
Using limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years
Google scholar http://scholar.google.com/
NCBI gene & protein databases: GenBank • GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations) • Holds 99billionbases (2008)
Searching NCBI for the protein human CD4 Search demonstration
Using field descriptions, qualifiers, and boolean operators • Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] • List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers • Boolean Operators:ANDORNOT Note: do not use the field Protein name [PROT], only GENE!
RefSeq • RefSeq: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)
Swissprot • A protein sequence database which strives to provide a high level of annotation:* the function of a protein* domains structure* post-translational modifications* variants • One entry for each protein
GenBank Vs. Swissprot Swiss-Prot results GenBank results
Fasta format header description ID/accession > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI sequence Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1
PDB: Protein Data Bank • Main database of 3D structures • Includes ~56,000 entries (proteins, nucleic acids, others) • Proteins organized in groups, families etc • Is highly redundant • different conformations (e.g., ligand dependent) • http://www.rcsb.org
Human CD4 in complex with HIV gp120 PDB ID 1G9M gp120 CD4
Organism specific databases • Model organisms have independent databases: HIV database http://hiv-web.lanl.gov/content/index http://gmod.org/wiki/Main_Page?q=node/71
Summary • General and comprehensive databases: • NCBI, EMBL, DDBJ • Genome specific databases: • ENSEMBL, UCSC genome browser • Highly annotated databases: • Proteins: • Swissprot, RefSeq • Structures: • PDB
And always remember: • Google (or any search engine) • RTFM -Read the manual!!! (/help/FAQ)
What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE
Local vs. Global • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ
Evolutionary changes in sequences In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: • Insertion - AAGA AAGTA • Deletion - AAGA AGA • Substitution- AAGA AACA Insertion + Deletion Indel
Scoring scheme • Match/mismatch scores: substitution matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolution (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan) • Gap penalty
Computation time:How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 107 sequences, it will take 106seconds = 11.5days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.
Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow
Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution
BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).
Query type: DNA or Protein All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database
Query type • Information content in the letters: • Nucleotides: 4 letter alphabet • Amino acids: 20 letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity • Selection (and hence conservation) works (mostly) at the protein level The amino-acid sequence is often preferable for homology search
E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous).E-values between 10-2 and 1 do not indicate a good homology
Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology
BLAST 2 sequences at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an optimal algorithm but a heuristic
Bl2Seq - query • blastn – nucleotide blastp – protein
Bl2seq results Dissimilarity Low complexity Gaps Similarity Match
BLAST – programs Query: DNA Protein Database: DNA Protein