Programming for Engineers in Python

Programming for Engineers in Python Biopython

Classes class<classname>: statement_1 . . statement_n • The methods of a class get the instance as the first parameter • traditionally named self • The method __init__ is called upon object construction (if available)

Classes • Reminder: type = data representation + behavior. • Classes are user-defined types. class<classname>: statement_1 . . statement_n • Objects of a class are called class instances. • Like a mini-program: • Variables. • Function Definitions. • Even arbitrary commands.

Classes – Attributes and Methods classVector2D: def__init__ (self, x, y):self.x, self.y= x, y defsize (self):return (self.x** 2 +self.y** 2) **0.5 Instance Attributes(each instance has its own copy) Methods

Classes – Instantiate and Use >>>v = Vector2D(3, 4) # Make instance.>>> v<__main__.Vector2D object at 0x00000000031A2828>>>>v.size() # Call method on instance.5.0

Example – Multimap • A dictionary with more than one value for each key • We already needed it once or twice and used: >>>lst=d.get(key, [ ])>>>lst.append(value)>>> d[key] =lst • We will now write a new class that will be a wrapperaround a dict • The class will have methods that allow us to keep multiple values for each key

Multimap. partial code classMultimap:def__init__(self):'''Create an empty Multimap'''self.inner= innerdefget(self, key):'''Return list of values associated with key'''returnself.inner.get(key, [])defput(self, key, value):'''Adds value to the list of values associated with key'''value_list=self.get(key)if value notinvalue_list:value_list.append(value)self.inner[key] =value_list

Multimapput_all and remove defput_all(self, key, values):for v in values: self.put(key, v) defremove(self, key, value):value_list=self.get(key)if value invalue_list:value_list.remove(value)self.inner[key] =value_listreturnTruereturnFalse

Multimap. Partial code def__len__(self):'''Returns the number of keys in the map'''returnlen(self.inner)def__str__(self):'''Converts the map to a string'''returnstr(self.inner)def__cmp__(self, other):'''Compares the map with another map'''returnself.inner.cmp(other)def__contains__(self, key):'''Returns True if key exists in the map'''returnself.has_key(k)

Multimap • Use case – a dictionary of countries and their cities: >>> m =Multimap() >>>m.put('Israel', 'Tel-Aviv') >>>m.put('Israel', 'Jerusalem') >>>m.put('France', 'Paris') >>>m.put_all('England',('London', 'Manchester', 'Moscow')) >>>m.remove('England', 'Moscow') >>> print m.get('Israel') ['Tel-Aviv', 'Jerusalem']

Biopython • An international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology • Provides tools for • Parsing files (fasta, clustalw, GenBank,…) • Interface to common softwares • Operations on sequences • Simple machine learning applications • BLAST • And many more

Installing Biopython • Go to http://biopython.org/wiki/Download • Windows • Unix • Select python 2.7 • NumPy is required

SeqIO • The standard Sequence Input/Output interface for BioPython • Provides a simple uniform interface to input and output assorted sequence file formats • Deals with sequences as SeqRecord objects • There is a sister interface Bio.AlignIO for working directly with sequence alignment files as Alignment objects

Parsing a FASTA file # Parse a simple fasta filefrom Bio importSeqIOforseq_recordinSeqIO.parse("ls_orchid.fasta", "fasta"):print seq_record.idprintrepr(seq_record.seq)printlen(seq_record) Why repr and not str?

GenBank files # genbank filesfrom Bio importSeqIOforseq_recordinSeqIO.parse("ls_orchid.gbk", "genbank"):printseq_record# added to print just one record examplebreak

GenBank files from Bio importSeqIOforseq_recordinSeqIO.parse("ls_orchid.gbk", "genbank"):print seq_record.idprintrepr(seq_record.seq)printlen(seq_record)

Sequence objects • Support similar methods as standard strings • Provide additional methods • Translate • Reverse complement • Support different alphabets • AGTAGTTAAA can be • DNA • Protein

Sequences and alphabets • Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions • For example: • Adding ambiguous symbols • Adding special new characters

Example – generic alphabet Non-specific alphabet

Example – specific sequences >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_seq=Seq("AGTACACTGGT", IUPAC.unambiguous_dna)>>>my_seqSeq('AGTACACTGGT', IUPACUnambiguousDNA())>>>my_seq.alphabetIUPACUnambiguousDNA()>>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_prot=Seq("AGTACACTGGT", IUPAC.protein)>>>my_protSeq('AGTACACTGGT', IUPACProtein())>>>my_prot.alphabetIUPACProtein()

Sequences act like strings • Access elements • Count without overlaps >>>printmy_seq[0] #first letterG>>>printmy_seq[2] #third letterT>>>printmy_seq[-1] #last letterG >>>fromBio.SeqimportSeq>>>"AAAA".count("AA")2>>>Seq("AAAA").count("AA")2

Calculate GC content >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>fromBio.SeqUtilsimport GC>>>my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)>>> GC(my_seq)46.875

Slicing • Simple slicing • Start, stop, stride >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)>>>my_seq[4:12]Seq('GATGGGCC', IUPACUnambiguousDNA()) >>>my_seq[0::3]Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())>>>my_seq[1::3]Seq('AGGCATGCATC', IUPACUnambiguousDNA())>>>my_seq[2::3]Seq('TAGCTAAGAC', IUPACUnambiguousDNA())

Concatenation • Simple addition as in Python • But, alphabets must fit >>>fromBio.Alphabetimport IUPAC>>>fromBio.SeqimportSeq>>>protein_seq=Seq("EVRNAK", IUPAC.protein)>>>dna_seq=Seq("ACGT", IUPAC.unambiguous_dna)>>>protein_seq+dna_seqTraceback (most recent call last): …

Changing case >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimportgeneric_dna>>>dna_seq=Seq("acgtACGT", generic_dna)>>>dna_seqSeq('acgtACGT', DNAAlphabet())>>>dna_seq.upper()Seq('ACGTACGT', DNAAlphabet())>>>dna_seq.lower()Seq('acgtacgt', DNAAlphabet())

Changing case • Case is important for matching • IUPAC names are upper case >>>"GTAC"indna_seqFalse>>>"GTAC"indna_seq.upper()True >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>dna_seq=Seq("ACGT", IUPAC.unambiguous_dna)>>>dna_seqSeq('ACGT', IUPACUnambiguousDNA())>>>dna_seq.lower()Seq('acgt', DNAAlphabet())

Reverse complement >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)>>>my_seq.complement()Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())>>>my_seq.reverse_complement()Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

Transcription >>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)>>>template_dna=coding_dna.reverse_complement()>>>messenger_rna=coding_dna.transcribe()>>>messenger_rnaSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) As you can see, all this does is switch T → U, and adjust the alphabet.

Translation Simple example >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>messenger_rna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)>>>messenger_rnaSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())>>>messenger_rna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) Stop codon!

Translation from the DNA >>>fromBio.SeqimportSeq>>>fromBio.Alphabetimport IUPAC>>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)>>>coding_dnaSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())>>>coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

Using different translation tables • In several cases we may want to use different translation tables • Translation tables are given IDs in GenBank (standard=1) • Vertebrate Mitochondrial is table 2 • More details in http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

Using different translation tables • >>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>>coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(table="Vertebrate Mitochondrial")Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(table=2)Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

Translation tables in biopython

Translate up to the first stop in frame >>>coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(to_stop=True)Seq('MAIVMGR', IUPACProtein())>>>coding_dna.translate(table=2)Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))>>>coding_dna.translate(table=2, to_stop=True)Seq('MAIVMGRWKGAR', IUPACProtein())

Comparing sequences Standard “==“ comparison is done by comparing the references (!), hence: >>> seq1 =Seq("ACGT", IUPAC.unambiguous_dna)>>> seq2 =Seq("ACGT", IUPAC.unambiguous_dna)>>> seq1==seq2 • Warning (from warnings module): • … FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception)… please use str(seq1)==str(seq2)to make your code explicit and to avoid this warning.False>>> seq1==seq1True

Mutable vs. Immutable • Like strings standard seq objects are immutable • If you want to create a mutable object you need to write it by either: • Use the “tomutable()” method • Use the mutable constructor mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)

Unknown sequences example • In many biological cases we deal with unknown sequences >>>fromBio.SeqimportUnknownSeq>>>fromBio.Alphabetimport IUPAC>>>unk_dna=UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)>>>my_seq=Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)>>>unk_dna+my_seqSeq('NNNNNNNNNNNNNNNNNNNNGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACAmbiguousDNA())

MSA

Read MSA • Use Bio.AlignIO.read(file, format) • File – the file path • Format support: • “stockholm” • “fasta” • “clustal” • … • Use help(AlignIO) for details

Example • We want to parse this file from PFAM

Example from Bio importAlignIOalignment =AlignIO.read("PF05371.sth", "stockholm")print alignment

Alignment object example >>>from Bio importAlignIO>>> alignment =AlignIO.read("PF05371_seed.sth", "stockholm")>>>print alignment[1]ID: Q9T0Q8_BPIKE/1-52Name: Q9T0Q8_BPIKEDescription: Q9T0Q8_BPIKE/1-52Number of features: 0/start=1/end=52/accession=Q9T0Q8.1Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA', SingleLetterAlphabet())

Alignment object example >>>print"Alignment length %i"%alignment.get_alignment_length()Alignment length 52>>>for record in alignment:print"%s - %s"% (record.seq, record.id)AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73

Cross-references example Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure?>>>for record in alignment:ifrecord.dbxrefs:print record.id, record.dbxrefsCOATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']

Comments • Remember that almost all MSA formats are supported • When you have more than one MSA in your files use AlignIO.parse() • Common example is PHYLIP’s output • Use AlignIO.parse("resampled.phy", "phylip") • The result is an iterator object that contains all MSAs

Write alignment to file fromBio.Alphabetimportgeneric_dnafromBio.SeqimportSeqfromBio.SeqRecordimportSeqRecordfromBio.AlignimportMultipleSeqAlignmentalign1 =MultipleSeqAlignment([SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),])from Bio importAlignIOAlignIO.write(align1, "my_example.phy", "phylip") 3 12Alpha ACTGCTAGCT AGBeta ACT-CTAGCT AGGamma ACTGCTAGDT AG3 9Delta GTCAGC-AGEpislon GACAGCTAGZeta GTCAGCTAG3 13Eta ACTAGTACAG CTGTheta ACTAGTACAG CT-Iota - CTACTACAG GTG

Slicing Alignments work like numpy matrices >>>print alignment[2,6]T# You can pull out a single column as a string like this:>>>print alignment[:,6]TTT---T >>>print alignment[3:6,:6]SingleLetterAlphabet() alignment with 3 rows and 6 columnsAEGDDP COATB_BPM13/24-72AEGDDP COATB_BPZJ2/1-49AEGDDP Q9T0Q9_BPFD/1-49>>>print alignment[:,:6]SingleLetterAlphabet() alignment with 7 rows and 6 columnsAEPNAA COATB_BPIKE/30-81AEPNAA Q9T0Q8_BPIKE/1-52DGTSTA COATB_BPI22/32-83AEGDDP COATB_BPM13/24-72AEGDDP COATB_BPZJ2/1-49AEGDDP Q9T0Q9_BPFD/1-49FAADDA COATB_BPIF1/22-73

External applications • How do we call MSA algorithms on unaligned set of sequences? • Biopython provides wrappers • The idea: • Create a command line object with the algorithm options • Invoke the command (Python uses subprocesses) • Bio.Align.Applications module: • >>> import Bio.Align.Applications • >>> dir(Bio.Align.Applications) • ['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ]

Programming for Engineers in Python