1 / 136

Programming for Engineers in Python

Programming for Engineers in Python. Biopython. Classes. class < classname > : statement_1 . . statement_n The methods of a class get the instance as the first parameter traditionally named self The method __init__ is called upon object construction (if available). Classes.

aderes
Télécharger la présentation

Programming for Engineers in Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming for Engineers in Python Biopython

  2. Classes class<classname>: statement_1 . . statement_n • The methods of a class get the instance as the first parameter • traditionally named self • The method __init__ is called upon object construction (if available)

  3. Classes • Reminder: type = data representation + behavior. • Classes are user-defined types. class<classname>: statement_1 . . statement_n • Objects of a class are called class instances. • Like a mini-program: • Variables. • Function Definitions. • Even arbitrary commands.

  4. Classes – Attributes and Methods classVector2D: def__init__ (self, x, y):self.x, self.y= x, y defsize (self):return (self.x** 2 +self.y** 2) **0.5 Instance Attributes(each instance has its own copy) Methods

  5. Classes – Instantiate and Use >>>v = Vector2D(3, 4) # Make instance.>>> v<__main__.Vector2D object at 0x00000000031A2828>>>>v.size() # Call method on instance.5.0

  6. Example – Multimap • A dictionary with more than one value for each key • We already needed it once or twice and used: >>>lst=d.get(key, [ ])>>>lst.append(value)>>> d[key] =lst • We will now write a new class that will be a wrapperaround a dict • The class will have methods that allow us to keep multiple values for each key

  7. Multimap. partial code classMultimap:def__init__(self):'''Create an empty Multimap'''self.inner= innerdefget(self, key):'''Return list of values associated with key'''returnself.inner.get(key, [])defput(self, key, value):'''Adds value to the list of values associated with key'''value_list=self.get(key)if value notinvalue_list:value_list.append(value)self.inner[key] =value_list

  8. Multimapput_all and remove defput_all(self, key, values):for v in values: self.put(key, v) defremove(self, key, value):value_list=self.get(key)if value invalue_list:value_list.remove(value)self.inner[key] =value_listreturnTruereturnFalse

  9. Multimap. Partial code def__len__(self):'''Returns the number of keys in the map'''returnlen(self.inner)def__str__(self):'''Converts the map to a string'''returnstr(self.inner)def__cmp__(self, other):'''Compares the map with another map'''returnself.inner.cmp(other)def__contains__(self, key):'''Returns True if key exists in the map'''returnself.has_key(k)

  10. Multimap • Use case – a dictionary of countries and their cities: >>> m =Multimap() >>>m.put('Israel', 'Tel-Aviv') >>>m.put('Israel', 'Jerusalem') >>>m.put('France', 'Paris') >>>m.put_all('England',('London', 'Manchester', 'Moscow')) >>>m.remove('England', 'Moscow') >>> print m.get('Israel') ['Tel-Aviv', 'Jerusalem']

  11. Biopython • An international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology • Provides tools for • Parsing files (fasta, clustalw, GenBank,…) • Interface to common softwares • Operations on sequences • Simple machine learning applications • BLAST • And many more

  12. Installing Biopython • Go to http://biopython.org/wiki/Download • Windows • Unix • Select python 2.7 • NumPy is required

  13. SeqIO • The standard Sequence Input/Output interface for BioPython • Provides a simple uniform interface to input and output assorted sequence file formats • Deals with sequences as SeqRecord objects • There is a sister interface Bio.AlignIO for working directly with sequence alignment files as Alignment objects

  14. Parsing a FASTA file # Parse a simple fasta filefrom Bio importSeqIOforseq_recordinSeqIO.parse("ls_orchid.fasta", "fasta"):print seq_record.idprintrepr(seq_record.seq)printlen(seq_record) Why repr and not str?

  15. GenBank files # genbank filesfrom Bio importSeqIOforseq_recordinSeqIO.parse("ls_orchid.gbk", "genbank"):printseq_record# added to print just one record examplebreak

  16. GenBank files from Bio importSeqIOforseq_recordinSeqIO.parse("ls_orchid.gbk", "genbank"):print seq_record.idprintrepr(seq_record.seq)printlen(seq_record)

  17. Sequence objects • Support similar methods as standard strings • Provide additional methods • Translate • Reverse complement • Support different alphabets • AGTAGTTAAA can be • DNA • Protein

  18. Sequences and alphabets • Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions • For example: • Adding ambiguous symbols • Adding special new characters

  19. Example – generic alphabet Non-specific alphabet

  20. Example – specific sequences >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_seq=Seq("AGTACACTGGT", IUPAC.unambiguous_dna)>>>my_seqSeq('AGTACACTGGT', IUPACUnambiguousDNA())>>>my_seq.alphabetIUPACUnambiguousDNA()>>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_prot=Seq("AGTACACTGGT", IUPAC.protein)>>>my_protSeq('AGTACACTGGT', IUPACProtein())>>>my_prot.alphabetIUPACProtein()    

  21. Sequences act like strings • Access elements • Count without overlaps >>>printmy_seq[0] #first letterG>>>printmy_seq[2] #third letterT>>>printmy_seq[-1] #last letterG >>>fromBio.SeqimportSeq>>>"AAAA".count("AA")2>>>Seq("AAAA").count("AA")2

  22. Calculate GC content >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>fromBio.SeqUtilsimport GC>>>my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)>>> GC(my_seq)46.875

  23. Slicing • Simple slicing • Start, stop, stride >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)>>>my_seq[4:12]Seq('GATGGGCC', IUPACUnambiguousDNA()) >>>my_seq[0::3]Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())>>>my_seq[1::3]Seq('AGGCATGCATC', IUPACUnambiguousDNA())>>>my_seq[2::3]Seq('TAGCTAAGAC', IUPACUnambiguousDNA())

  24. Concatenation • Simple addition as in Python • But, alphabets must fit >>>fromBio.Alphabetimport IUPAC>>>fromBio.SeqimportSeq>>>protein_seq=Seq("EVRNAK", IUPAC.protein)>>>dna_seq=Seq("ACGT", IUPAC.unambiguous_dna)>>>protein_seq+dna_seqTraceback (most recent call last): …

  25. Changing case >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimportgeneric_dna>>>dna_seq=Seq("acgtACGT", generic_dna)>>>dna_seqSeq('acgtACGT', DNAAlphabet())>>>dna_seq.upper()Seq('ACGTACGT', DNAAlphabet())>>>dna_seq.lower()Seq('acgtacgt', DNAAlphabet())

  26. Changing case • Case is important for matching • IUPAC names are upper case >>>"GTAC"indna_seqFalse>>>"GTAC"indna_seq.upper()True >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>dna_seq=Seq("ACGT", IUPAC.unambiguous_dna)>>>dna_seqSeq('ACGT', IUPACUnambiguousDNA())>>>dna_seq.lower()Seq('acgt', DNAAlphabet())

  27. Reverse complement >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)>>>my_seq.complement()Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())>>>my_seq.reverse_complement()Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

  28. Transcription >>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)>>>template_dna=coding_dna.reverse_complement()>>>messenger_rna=coding_dna.transcribe()>>>messenger_rnaSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) As you can see, all this does is switch T → U, and adjust the alphabet.

  29. Translation Simple example >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>messenger_rna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)>>>messenger_rnaSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())>>>messenger_rna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) Stop codon!

  30. Translation from the DNA >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)>>>coding_dnaSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())>>>coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

  31. Using different translation tables • In several cases we may want to use different translation tables • Translation tables are given IDs in GenBank (standard=1) • Vertebrate Mitochondrial is table 2 • More details in http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

  32. Using different translation tables • >>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>>coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(table="Vertebrate Mitochondrial")Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(table=2)Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

  33. Translation tables in biopython

  34. Translate up to the first stop in frame >>>coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(to_stop=True)Seq('MAIVMGR', IUPACProtein())>>>coding_dna.translate(table=2)Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(table=2, to_stop=True)Seq('MAIVMGRWKGAR', IUPACProtein())

  35. Comparing sequences Standard “==“ comparison is done by comparing the references (!), hence: >>> seq1 =Seq("ACGT", IUPAC.unambiguous_dna)>>> seq2 =Seq("ACGT", IUPAC.unambiguous_dna)>>> seq1==seq2 • Warning (from warnings module): • … FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception)… please use str(seq1)==str(seq2)to make your code explicit and to avoid this warning.False>>> seq1==seq1True

  36. Mutable vs. Immutable • Like strings standard seq objects are immutable • If you want to create a mutable object you need to write it by either: • Use the “tomutable()” method • Use the mutable constructor mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)

  37. Unknown sequences example • In many biological cases we deal with unknown sequences >>>fromBio.SeqimportUnknownSeq>>>fromBio.Alphabetimport IUPAC>>>unk_dna=UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)>>>my_seq=Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)>>>unk_dna+my_seqSeq('NNNNNNNNNNNNNNNNNNNNGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACAmbiguousDNA())

  38. MSA

  39. Read MSA • Use Bio.AlignIO.read(file, format) • File – the file path • Format support: • “stockholm” • “fasta” • “clustal” • … • Use help(AlignIO) for details

  40. Example • We want to parse this file from PFAM

  41. Example from Bio importAlignIOalignment =AlignIO.read("PF05371.sth", "stockholm")print alignment

  42. Alignment object example >>>from Bio importAlignIO>>> alignment =AlignIO.read("PF05371_seed.sth", "stockholm")>>>print alignment[1]ID: Q9T0Q8_BPIKE/1-52Name: Q9T0Q8_BPIKEDescription: Q9T0Q8_BPIKE/1-52Number of features: 0/start=1/end=52/accession=Q9T0Q8.1Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA', SingleLetterAlphabet())

  43. Alignment object example >>>print"Alignment length %i"%alignment.get_alignment_length()Alignment length 52>>>for record in alignment:print"%s - %s"% (record.seq, record.id)AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73       

  44. Cross-references example Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure?>>>for record in alignment:ifrecord.dbxrefs:print record.id, record.dbxrefsCOATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']

  45. Comments • Remember that almost all MSA formats are supported • When you have more than one MSA in your files use AlignIO.parse() • Common example is PHYLIP’s output • Use AlignIO.parse("resampled.phy", "phylip") • The result is an iterator object that contains all MSAs

  46. Write alignment to file fromBio.Alphabetimportgeneric_dnafromBio.SeqimportSeqfromBio.SeqRecordimportSeqRecordfromBio.AlignimportMultipleSeqAlignmentalign1 =MultipleSeqAlignment([SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),])from Bio importAlignIOAlignIO.write(align1, "my_example.phy", "phylip") 3 12Alpha     ACTGCTAGCT AGBeta     ACT-CTAGCT AGGamma     ACTGCTAGDT AG3 9Delta     GTCAGC-AGEpislon    GACAGCTAGZeta     GTCAGCTAG3 13Eta         ACTAGTACAG CTGTheta     ACTAGTACAG CT-Iota     - CTACTACAG GTG

  47. Slicing Alignments work like numpy matrices >>>print alignment[2,6]T# You can pull out a single column as a string like this:>>>print alignment[:,6]TTT---T >>>print alignment[3:6,:6]SingleLetterAlphabet() alignment with 3 rows and 6 columnsAEGDDP COATB_BPM13/24-72AEGDDP COATB_BPZJ2/1-49AEGDDP Q9T0Q9_BPFD/1-49>>>print alignment[:,:6]SingleLetterAlphabet() alignment with 7 rows and 6 columnsAEPNAA COATB_BPIKE/30-81AEPNAA Q9T0Q8_BPIKE/1-52DGTSTA COATB_BPI22/32-83AEGDDP COATB_BPM13/24-72AEGDDP COATB_BPZJ2/1-49AEGDDP Q9T0Q9_BPFD/1-49FAADDA COATB_BPIF1/22-73

  48. External applications • How do we call MSA algorithms on unaligned set of sequences? • Biopython provides wrappers • The idea: • Create a command line object with the algorithm options • Invoke the command (Python uses subprocesses) • Bio.Align.Applications module: • >>> import Bio.Align.Applications • >>> dir(Bio.Align.Applications) • ['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ]

More Related