160 likes | 180 Vues
Pattern matching. We need two things:. A syntax to encode sets of (sub)strings A set of tools to search (sub)strings in strings. The Python syntax to express a set of strings. Regular expressions. ‘ ABC ’ , ‘ abc ’ , ‘ Abc ’ , ‘ aBc ’ , ‘ abC ’ , ‘ ABc ’ , ‘ aBC ’ , ‘ aBC ’.
E N D
We need two things: A syntax to encode sets of (sub)strings A set of tools to search (sub)strings in strings
The Python syntax to express a set of strings Regular expressions ‘ABC’, ‘abc’, ‘Abc’, ‘aBc’, ‘abC’, ‘ABc’, ‘aBC’, ‘aBC’ [Aa][Bb][Cc] Characters and metacharacters A regular expression is a string encoding a set of strings through the use of characters and metacharacters
The Python regular expression syntax [ ] ^ $ \ . | * + ? { } ( ) The meaning of ‘\’ depends on whether ‘\’ is followed by a metacharacter or a character. Repetitions:* + ? { }
A phosphorylation site motif R.[ST][^P] Protein names [SP]{0,1}[fhm]T{0,1}G{0,1}R 'SmTGR', 'PfTR', 'hTR', 'hGR'
We first have to compile a regular expression into a Python pattern object: >>> import re >>> motif = 'A[AC]T' >>> regexp = re.compile(motif) >>> regexp <_sre.SRE_Pattern object at 0x22de0> >>> motif 'A[AC]T' >>> >>> regexp = re.compile('A[AC]T') NB:
Pattern matching >>> S = 'R.[ST][^P]' >>> regexp = re.compile(S) >>>seq = ’SASRQSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRP' >>>#Here, we use search(): >>> m1 = regexp.search(seq) >>>#This returns a Match object: >>> m1 <_sre.SRE_Match object at 0x706e8> >>>#Match object group() method: >>> m1.group() 'RQSA’
Pattern matching >>> S = 'R.[ST][^P]' >>> regexp = re.compile(S) >>>seq = 'RQSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP’ >>>#Here whe use match(): >>> m2 = regexp.match(seq) >>>#that returns a Match object >>> m2 <_sre.SRE_Match object at 0x70020> >>> m2.group() 'RQSA'
group()returns the matching substring • span()returns a tuple containing the (start,end) of the match • start() returns the start positions of the match • end() returns the end positions of the match. >>> S = 'R.[ST][^P]' >>> regexp = re.compile(S) >>>seq = 'RQSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP' >>> m1 = regexp.search(seq) >>> m1.group() 'RQSA’ >>> m2.span() (0, 4) >>> m2.start() 0 >>> m2.end() 4
What if we want to find ALL matches of a regular expression and not only the first one? findall and finditer >>> S = 'R.[ST][^P]' >>> regexp = re.compile(S) >>> all = regexp.findall(seq) >>> all ['RQSA', 'RRSL', 'RPSK']
An iterator is a “container” of objects that can be traversed using a for loop. In this specific case, the iterator contains a set of Python match objects Mach objects can be individually accessed using Match object methods, such as group(), span(), start() and end()
>>> S = 'R.[ST][^P]' >>> regexp = re.compile(S) >>>seq = 'RQSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP >>> iter = regexp.finditer(seq) >>> iter <callable-iterator object at 0x786d0> >>> for s in iter: ... print s.group() ... print s.span() ... print s.start() ... print s.end() ... RQSA (0, 4) 0 4 RRSL (18, 22) 18 22 RPSK (40, 44) 40 44
Grouping How to divide a regular expression in subgroups matching different components of interest >>> S = 'R(.)[ST][^P]' >>> regexp = re.compile(S) >>>seq = 'RQSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP' >>> m1 = regexp.search(seq) >>> m1.group() 'RQSA' >>> m1.group(1) 'Q' >>> S = 'R(.{0,3})[ST][^P]' >>> regexp = re.compile(S) >>> seq = 'QSAMGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQRPSKP' >>> m1 = regexp.search(seq) >>> m1.group() 'RRRSL' >>> m1.group(1)
Modifying strings The re module also provides methods to search and replace substrings sub(r,s,[c]), subn(r,s,[c]) >>> r = '\|' >>> separator = re.compile(r) >>> s = 'ATOM:CA|RES:ALA|CHAIN:B|NUMRES:166' >>> new_s = separator.sub('@', s) >>> new_s 'ATOM:CA@RES:ALA@CHAIN:B@NUMRES:166' >>> new_s = separator.sub('@', s, 2) >>> new_s 'ATOM:CA@RES:ALA@CHAIN:B|NUMRES:166' >>> >>> new_s = separator.subn('@', s) >>> new_s ('ATOM:CA@RES:ALA@CHAIN:B@NUMRES:166', 3)
Modifying strings The re module also provides methods to search and replace substrings sub(r,s,[c]), subn(r,s,[c]) >>> r = '\|' >>> separator = re.compile(r) >>> s = 'ATOM:CA|RES:ALA|CHAIN:B|NUMRES:166' >>> new_s = separator.sub('@', s) >>> new_s 'ATOM:CA@RES:ALA@CHAIN:B@NUMRES:166' >>> new_s = separator.sub('@', s, 2) >>> new_s 'ATOM:CA@RES:ALA@CHAIN:B|NUMRES:166' >>> >>> new_s = separator.subn('@', s) >>> new_s ('ATOM:CA@RES:ALA@CHAIN:B@NUMRES:166', 3)
Summary • Search a functional site in a protein sequence • Search a TFBS in a genomic sequence • Fetch an abstract from PubMed (urllib2) • Search a word in a text (text mining)