1 / 28

Python

Python. What is Biopython ?. Biopython is a python library of resources for developers of Python-base software for bioinformatics and research. can parse bioinformatics files into local data structures Fasta , GenBank , Blast output Clustalw etc.

iren
Télécharger la présentation

Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Python

  2. What is Biopython? • Biopython is a python library of resources for developers of Python-base software for bioinformatics and research. • can parse bioinformatics files into local data structures • Fasta, GenBank, Blast output Clustalw etc. • Can access many files directly ( web database, NCBI) from within the script. • Works with sequences and records • Many search algorithms, comparative algorithms and format options.

  3. Installing BioPython • Comes with Anaconda. You don’t even have to type in the import commands! • If you use the standard IDLE environment you will need to download BioPython and place it in the proper directory. • Bioinformatics has become so important in recent years that almost every programming environment, C++, Perl, etc has its own Bioinfo libraries.

  4. Sequence objects • Biological sequences represent the main point of interest in Bioinformatics processing. Python includes a special datatype called a Sequence. • Sequence objects are not the same as Python strings. They are really strings together with additional information, such as an alphabet, and a variety of methods such as translate(), reverse_complement() and so on. • dna = ‘AGTACACTGGT ‘  this is a pure string • // Here is how you create a sequence object. • seqdna = Seq(‘AGTACACTGGT ‘, Alphabet())  sequence obj • Note that seqdna is a sequence object not just a string.

  5. Alphabets - See IUPAC (international union of pure and applied chemistry) • Alphabets are just the set of allowable characters that are used in the string. • IUPAC.unambiguous_dna is really just the set {A,C,G, T} of nucleotides. • IUPAC.unambiguous_rna is {A,C,G,U} • IUPAC.protein is just the 20 standard amino acids {A,R,N,D,C,Q,E,H,I,L,K,M,F,P,S,T,W,Y,V} • and others • We will use mainly the {A,C,G,T} DNA set. • Nice for type checking our sequences.

  6. Dumping Alphabets • from Bio.Alphabet import IUPAC • print IUPAC.unambiguous_dna.letters • print IUPAC.ambiguous_dna.letters • print IUPAC.unambiguous_rna.letters • print IUPAC.protein.letters • OUTPUT • GATC • GATCRYWSMKHBVDN • GAUC • ACDEFGHIKLMNPQRSTVWY

  7. Can work with Sequence objects like strings • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • my_seq= Seq("GATCG", IUPAC.unambiguous_dna) • print my_seq[0]  prints first letter • print len(my_seq)  print length of string in sequence • print Seq(“AAAA”).count(“AA”)  non overlapping count ie 2 • print GC(my_seq)  Gives the GC % of the sequence. • print my_seq[2:5]  We can even slice them. Returns a Seq. • #convert seqobj to a pure string obj • dna_string = str(my_seq)

  8. MutaableSeq • >>>from Bio.Seq import Seq • >>>from Bio.Alphabet import IUPAC • my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) • Observe what happens if you try to edit the sequence: • >>> my_seq[5] = "G" Traceback (most recent call last): ... TypeError: ’Seq’ object does not support item assignment However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it: • >>> mutable_seq= my_seq.tomutable() • >>> mutable_seqMutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA’, IUPACUnambiguousDNA())

  9. We can modify these • >>> mutable_seqMutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) • >>> mutable_seq[5] = "C" >>> mutable_seqMutableSeq(’GCCATCGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) >>> mutable_seq.remove("T") >>> mutable_seqMutableSeq(’GCCACGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) • >>> mutable_seq.reverse() • >>> mutable_seqMutableSeq(’CCCGTGGGAAAGTCGCCGGGTAATGCACCG’, IUPACUnambiguousDNA())

  10. Nucleotide sequences and (reverse) complements • >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC • >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) • >>> my_seq • Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) • >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) • >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

  11. Reversing a Sequence • an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step • # FORWARD • >>> my_seqSeq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) • #BACKWARD ( Using a -1 step slice ) • >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())

  12. Double Stranded DNA DNA coding strand (aka Crick strand, strand +1) 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ ||||||||||||||||||||||||||||||||||||||| 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ DNA template strand (aka Watson strand, strand −1)

  13. Transcription • 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ • ||||||||||||||||||||||||||||||||||||||| • 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ • Transcription • 5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ Single stranded messenger RNA

  14. Lets do some Reverse Comp • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) • template_dna= coding_dna.reverse_complement() • print template_dna • CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT

  15. Transcribe ( T->U ) • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) • messenger_rna = coding_dna.transcribe() • print messenger_rna • AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG • //or you can do both • messenger_rna = coding_dna.reverse_complement().transcribe()

  16. Translate into protein • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) • print messenger_rna • print messenger_rna.translate() # I added the spaces • AUG GCC AUU GUA AUG GGC CGC UGA AAG GGU GCC CGA UAG • MAIVMGR*KGAR* • # the * represents stop codons.

  17. StandardTranslation Table

  18. Printing Tables • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • from Bio.Data import CodonTable • stdTable = CodonTable.unambiguous_dna_by_id[1] • print stdTable • mitoTable= CodonTable.unambiguous_dna_by_id[2] • print mitoTable

  19. Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+-- Table 2 Vertebrate Mitochondrial, SGC1 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA W | A T | TTG L | TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L | CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I(s)| ACT T | AAT N | AGT S | T A | ATC I(s)| ACC T | AAC N | AGC S | C A | ATA M(s)| ACA T | AAA K | AGA Stop| A A | ATG M(s)| ACG T | AAG K | AGG Stop| G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V(s)| GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--

  20. Codon - Amino Acids

  21. The SeqRecord Object • A SeqRecord is a structure that allows the storage of additional information with a sequence. This includes the usual information found in standard genbank files. The following is a sampleof the fields in SeqRecord. • .seq - The sequence • .id - The primary ID used to identify the sequence (String) • .name – The common name of the sequence • .annotations – A dictionary of additional information about the sequence • .features –A list of SeqFeatureobjects • and others.

  22. Build From Scratch • >>> from Bio.Seq import Seq • >>> simple_seq = Seq("GATC") • >>> from Bio.SeqRecord import SeqRecord • >>> simple_seq_r= SeqRecord(simple_seq) • or pass in the id, description etc. • >>> simple_seq_r.id = "AC12345" • >>> simple_seq_r.description = "Made up sequence I wish I could write a paper about" • >>> print(simple_seq_r.description) • Made up sequence I wish I could write a paper about • >>> simple_seq_r.seq • Seq(’GATC’, Alphabet())

  23. Fill SeqRecord from Fasta file • >gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... pPCP1, complete sequence TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC • ------------------------------------------------------------------------------------------ • >>> from Bio import SeqIO • # Note that SeqIO.read will only read one record. • >>> record = SeqIO.read("NC_005816.fna", "fasta") • >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()), id=’gi|45478711|ref|NC_005816.1|’, name=’gi|45478711|ref|NC_005816.1|’, description=’gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... sequence’, dbxrefs=[])

  24. ACCEss the fields Individually • >>> record.seqSeq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()) • >>> record.id • ’gi|45478711|ref|NC_005816.1|’ • >>> record.description • ’gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... pPCP1, complete sequence’

  25. These are missing • >>> record.dbxrefs [] • >>> record.annotations {} • >>> record.letter_annotations {} • >>> record.features [] • Note which is a dict and which is a list!!

  26. Reading Genbank Files • LOCUS NC_005816 9609 bp DNA circular BCT 21-JUL-2008 • DEFINITION Yersinia pestisbiovar Microtus str. 91001 plasmid pPCP1, complete sequence. • ACCESSION NC_005816 • VERSION NC_005816.1 GI:45478711 • PROJECT GenomeProject:10638

  27. Read the File • >>> from Bio import SeqIO • >>> record = SeqIO.read("NC_005816.gb", "genbank") • >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, IUPACAmbiguousDNA()), id=’NC_005816.1’, name=’NC_005816’, description=’Yersinia pestisbiovar Microtus str. 91001 plasmid pPCP1, complete sequence.’, dbxrefs=[’Project:10638’])

  28. Read a record • from Bio import SeqIO • record = SeqIO.read("micoplasmaGen.gb","genbank") • print record.description • ct=0 • for f in record.features: • if f.type=='gene': • ct+=1 • print ct

More Related