Similarity Searches on Sequence Databases

Similarity Searches on Sequence Databases Chapter 7; Page:215

A story • H. pylori was discover in 1984 • its genome was first sequenced in 1990s • this was published in NATURE. • In this publication, all proteins translated by the genome were also published • HOW did they do in a short time?

HOW? • They compare the sequence of the genome of H. pylori with those of other bacteria. • Then they predicted the proteins of H. pylori and its metabolits.

What does this similarity mean? • if two protein or gene sequences are similar, they are homologues. • SO • They are from similar organisms • similar proteins means; • similar functions • similar structures • that is, similar charactersitics

How similar is very similar • For proteins; • if >25% identity between 2 proteins, they are similar The range of identity <25% is called the TWILIGHT ZONE. Nothing is sure about similarity. For nucleotides, the limit is 70% similarity (homologous)

Homology • Addition to %, some other information is essential to say that there is a homology between 2 ones: • Expectation value: less value, more homology, • Lenght of the similar segments • Patterns of a.a conservation • Number of insertions/deletions

BLAST (Basic Local Assightment and Search Tool) • 30 years ago, to scan the simility between our query and hundreds of others we would need several hours :-(print, put on the wall, compare one by one manualy:-) • NOW, by speedy computers, we compare ours with millons at most in several minutes.

BLASTing Protein Sequence • 2 strategies • Compare; • a protein with a protein database : BLASTP • a protein with a nucleotide database : TBLASTN (machine turns your nucleotide seq. into 6 possible sequence) Important BLAST servers • BLAST server from NCBI from USA • BLAST server from Swiss EMBnet • if U learn one, U use other(s)

Which we should choose • Dependin on; • Database: Choose the one using a database you want • Speed: Choose the one which is not crowded (in Turkey, no problem during day until 5 because US and Japan in dark) different BLAST servers return different results instead of the same query because of differences between their databases

BLAST output contains; • A graphic display • A hit list • The alighments • The parameters

A graphic display • which part of other sequences is similar to yours • This part can be different or absent in some servers. • What colors say: best, good, moderate,worse, worst • what does length say: the same length...homologous, shorter corresponds to the domain

A hit list • Accesion number (sp:SWISS-PROT) & name • Description: You estimate whether it is interested or not • Score: if <50, unreliable • E-value: lower E, more similarity; E>0.001.twilight zone. E approaching “0” is the best

Alignments • Alignments say smthng on similarities btw seq • % identity: >25% is good • length:length of alignment. short alignments gives generally high E values • Top is ours; bottom is hit; (+) shows similar aa • XXXXXX: low complexity region • numbers shows the coordinates

BLASTing DNA sequences • If it is reading frame, tranlate it to protein than blast. • if not choose one of them below a DNA from DNA: BLASTN a TDNA from TDNA: TBLASTX a TDNA from protein: BLASTX T:translated; it means blast tanslates our sequence into 6 possible protein sequence

Strategies for right choice of BLAST type for DNA

controlling blast: right parameters

Control sequence masking • Protein: Remove low-complexity regions • DNA: many repeats. filter”human repeats”

BLAST output • a less homologous sequence can be important WHAT? Adjust parameters • suitable database: decrease results, use swiss p. • use the magic tags of enrez query • Adjust E-value

PSI-BLAST (Position Specific Iterated-BLAST) • BLAST finds close relatives. • To find far relatives, use PSI-BLAST • It uses more complex scoring procedures.

Similarity Searches on Sequence Databases