BLAST

BLAST Using online BLAST database type from Bio.Blast import NCBIWWW result_handle= NCBIWWW.qblast("blastn", "nt", "8332116") sequence Blast program or from Bio.Blast import NCBIWWW from Bio import SeqIO record = SeqIO.read("m_cold.fasta", format="fasta") result_handle= NCBIWWW.qblast("blastn", "nt", record.seq) result_handle = NCBIWWW.qblast("blastn", "nt", record.format(“fasta”) Just sequence whole seq. information

BLAST Using local BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download - First we create a command line prompt (for example): blastp–query alphaProtein.faa–db HS –out out.txt–evalue 0.01 - then we use the BLAST Python wrappers from BioPython: evalue threshold Blast program query sequence output file database from Bio.Blast.Applications import NcbiblastpCommandline blastp_cline = NcbiblastpCommandline(cmd ="~/ncbi-blast-2.7.1+/bin/blastp", query="alphaProtein.faa", db="HS", evalue=0.01, out="out.txt") blastp_cline() use Bio.Blast.NCBIStandalone.BlastParser() to parse results in the output file

BLAST Parsing BLAST output from Bio.Blast import NCBIStandalone result_handle = open(”out.txt") blast_parser = NCBIStandalone.BlastParser() blast_record = blast_parser.parse(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < 0.001: print (alignment.title) print (alignment.length) print (hsp.expect) print (hsp.query) print (hsp.match) print (hsp.sbjct) list of alignments holds the information about Blast output E-value query sequence alignment matched db seq.

Algorithm evolution Smith Waterman Localalignment algorithm – finds small, locally similar regions (substrings), matrix-based, each cell in the matrix defined the end of a potential alignment. BLAST – start with highest scoring short pairs and extend and down the sequence. Great, but when you’re talking about millions of reads…

Multiple alignment • The linear comparison of more than two sequences • Places residues in columns per position specific similarity scores • reflects relationships of the sequences • the scores are based on indels (gaps) and substitutions. • The alignment of residues implies that they have similar roles in the proteins or DNA sequences being aligned • e.g. protein active sites or transcription factor binding sites • Strength in numbers: the structure/function message from a multiple alignment is stronger than that of a pairwise alignment.

Uses of multiple alignments • Identification of functionally or structurally conserved domain/motif • - biological meaning, domain groups, motif matrices • - ProSite, InterPro, etc • Classification of domains into families • - biological or structural meaning • - Pfam, SCOP • Evolutionary studies • - phylogenetic inference of gene or species evolution • Structure prediction • - homology modeling

Multiple alignment method • Find homologous sequences Homologous sequences share a common ancestor, usually relatively high sequence similarity Not all similar proteins are homologous: Similarity may have come about due to convergent evolution or by chance.

Homology is detectable When there is consensus over a relatively long stretch of sequence OR When the conservation is high within functionally relevant regions THUS Statistical methods based on position-specific matrices help to provide some evidence BUT You usually need to check your alignment by eye to make sure it makes sense AND May need structural data to recognize homology

Finding how many sequences? Use different BLAST algorithms. The more seqs you have the stronger your alignment will be … • Depends on your sequence type and your question • Beware of redundant sequences (choose a threshold relevant to your question) • Beware of pulling in unrelated sequences (take a good look at your dataset) (More sequences means longer computational time, but this is why we have Pegasus)

Global multiple alignment Assumes conserved regions occur in same order Begins by aligning them from the beginning of the sequence Allows gaps Builds a consensus sequence, or a profile if based on statistical calculations Most useful for defining protein families and evolutionary work

Local multiple alignment Assumes conserved regions can be duplicated and can occur in different order along the seqs Block A Block B Block C Block D Most useful for finding motifs (shorter sequence lengths)

Gaps and substitutions For protein msa, PAM, BLOSUM, or other scoring matrices are used for gaps and substitutions – but with position specific weighting. Clustaldefault is BLOSUM68 MUSCLE uses 200PAM plus their own log-expectation matrix PAM is based on number of changes per evolutionary rate – the higher, the less stringent, eg 250 PAM is casting a wide net BLOSUM is based on frequency of changes in closely conserved blocks of motifs – the higher the more stringent, eg BLOSUM80 is biased towards finding motifs that are highly conserved (to 80%), BLOSUM68 less so etc.

Gaps and substitutions PAM, BLOSUM, or other scoring matrices are used for gaps and substitutions – but with position specific weighting. ClustalW default is BLOSUM MUSCLE uses 200PAM plus their own log-expectation matrix For protein sequences, more chance of having indels in the outer loops than inner core or catalytic domain For non-coding DNA, repeats and transposons may occur For structure RNAs, loop regions are more variable than stem regions

Evolution of Algorithms Profiles position specific scoring matrix based on amino acid conservation PSI-BLAST position specific iterative scoring matrix plus BLAST Hidden Markov Models position specific scoring matrix plus position specific gap penalties Structural information? Not trivial…

multiple alignment 1. Exhaustive approaches - mathematically very accurate alignments are optimal BUT these are very complex and take a huge amount of time 2. Heuristic methods - slightly less accurate, alignments are good but not optimal AND are usually enough for biological questions

Multiple alignment method • Find homologous sequences • Place the sequences in a relevant format (usually FASTA), and edit to similar length. • Example • >ACTB cggcctccagatggtctgggagggcagttcagctgtggctgcgcatagcagacatacaacggacggtgggcccagacccaggctgtgtagacccagcccccccgccccgcagtgcctaggtcacccactaacgccccaggccttgtcttggctgggcgtgactgttaccctcaaaagcaggcagctccagggtaaaaggtgccctgccctgtagagcccaccttccttcccagggctgcggctgggtaggtttgtagccttcatcacgggccacctccagccactggaccgctggcccctgccctgtcctggggagtgtggtcctgcgacttctaagtggccgcaagcca • >AGPAT1 tctgcctctccacagtgcccttataccagccccctcccagatctcatctgaatgtgatccatatttcctggttctccccgactcaactgatgcgtgcctcccttaacctttgtgtctcacttgtttccacctgcacagctaagacccctcacttctctggggtaaggtggctcgggtctcacattgtcctgccactccccgccccaccttctcttctcagcacatcacgtgcctcagctcctggttcctaagacctttctttccacagatctcgaccgttatactcccacccacacataccagcaaagtcttatgtctcctgtcgggcttcacctatgggaacgtgccct • You can use a list of accession numbers if you already know that the sequences are of similar lengths.

Multiple alignment method • Find homologous sequences • Place the sequences in a relevant format (usually FASTA), and edit to similar length. • 3. Run a multiple alignment program • ClustalW- oldest, flexible, robust • ClustalΩ- latest version, scalable, more accurate with addition of HMM • MUSCLE - fast, good for finding short motifs in small datasets • PRALINE - includes secondary structure information • T-Coffee - good for small datasets of shorter sequences, has a module for checking input seqs against the PDB • COBALT - uses domain conservation information (from BLAST page) which by definition has some structural information

Clustal family • Clustal X Uses progressive global alignment algorithm Graphic user interface only • Clustal W and W2 Command line tool, W2 also had a web interface Has a parallelized version, to cope with larger datasets • ClustalΩ HMM searches added to algorithm Command line and web interface Scalable to very large datasets

Input of Data • 3 or more sequences are needed, nucleic or amino acids, several formats are accepted: eg FASTA text files • Remove any white space or empty lines • The analysis will fail if two sequences have the same name • Can copy/paste sequences into Clustal or upload a txt file

BLAST

BLAST

Presentation Transcript

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST:

BLAST

BLAST

BLAST

BLAST

BLAST

BLAST

Blast

BLAST

BLAST

BLAST

BLAST

BLAST