470 likes | 559 Vues
More PHP functions for using regular expressions. Up to now, we have seen just one of the library of functions which PHP provides for using regular expressions The full library is described in Chapter CVIII of the PHP Manual We will consider just two more of them preg_match and
E N D
More PHP functions for using regular expressions • Up to now, we have seen just one of the library of functions which PHP provides for using regular expressions • The full library is described in Chapter CVIII of the PHP Manual • We will consider just two more of them • preg_match and • preg_match_all
preg_match • The format of a call to this function is int preg_match ( string pattern, string subject [, array &matches [, int flags [, int offset]]] ) • As can be seen, only the first two arguments are required, so a minimum-argument call is of the form int preg_match ( string regexp, string subject) which returns 0 if the regexp is not matched inside the subject string, or 1 if the regexp is matched inside the string
PHP code <?php $document = "<h1>France</h1> <p>Foods of France: <ol><li>wine</li><li>bread</li></ol></p>"; $regexp = "%<p>.+</p>%"; if ( preg_match($regexp,$document) ) { echo "Yes"; } else { echo "No"; } ?> • Output No • Why no match? • Answer: on next slide
We need to make the dot match newlines • Revised PHP code <?php $document = "<h1>France</h1> <p>Foods of France: <ol><li>wine</li><li>bread</li></ol></p>"; $regexp = "%<p>.+</p>%s"; if ( preg_match($regexp,$document) ) { echo "Yes"; } else { echo "No"; } ?> • Output Yes
preg_match (contd.) • Frequently, it is useful to use the third, optional, argument int preg_match ( string pattern, string subject, array &matches ) • As before, this returns o or 1 depending on whether a match was found in the subject string • However, in addition, elements of the array in the third argument are set to match parts of the matching substring of the target string • matches[0] is set to contain the whole substring • matches[1] is set to contain the first () substring • matches[2] is set to contain the second () substring • … etc
PHP code <?php $document = "<h1>France</h1> <p>Foods of France: <ol><li>wine</li><li>bread</li></ol></p>"; echo "document is ".str_replace("<","<",$document)." <br>"; $regexp = " %<p>(.+)</p>%s "; if ( preg_match($regexp,$document,$matches) ) { echo "Yes <br>"; echo "matches[0] is ".str_replace("<","<",$matches[0])."<br>"; echo "matches[1] is ".str_replace("<","<",$matches[1])."<br>"; } else { echo "No"; } ?> • Output document is <h1>France</h1> <p>Foods of France: <ol><li>wine</li><li>bread</li></ol></p> Yes matches[0] is <p>Foods of France: <ol><li>wine</li><li>bread</li></ol></p> matches[1] is Foods of France: <ol><li>wine</li><li>bread</li></ol>
preg_match_all • So why would we need a function called preg_match_all • See next slide
PHP code <?php $document = "<p> This is paragraph 1. </p> <p> And this is paragraph 2.</p>"; echo "document is ".str_replace("<","<",$document)." <br>"; $regexp = "%<p>(.+?)</p>%s"; if ( preg_match($regexp,$document,$matches) ) { echo "Yes <br>"; echo "matches[0] is ".str_replace("<","<",$matches[0])."<br>"; echo "matches[1] is ".str_replace("<","<",$matches[1])."<br>"; } else { echo "No"; } ?> • Output document is <p>This is paragraph 1.</p> <p>And this is paragraph 2.</p> Yes matches[0] is <p>This is paragraph 1.</p> matches[1] is This is paragraph 1. • That is, preg_match only finds the first match
preg_match_all • preg_match_all is like preg_match except • that it finds all matches and • thus, the value returned in $matches is actually an array of arrays int preg_match_all ( string pattern, string subject, array &matches ) • $matches[0] is an array of all the substrings which match the overall regular expression • $matches[1] is an array of all the substrings which match the first parenthesised sub-expression • $matches[2] is an array of all the substrings which match the second parenthesised sub-expression • and so on
PHP code <?php $document = "<p>This is paragraph 1.</p> <p>And this is paragraph 2.</p><p>Paragraph 3.</p>"; echo str_replace("<","<",$document)." <br>"; $regexp = "%<p>(.+?)</p>%s"; if ( preg_match_all($regexp,$document,$matches) ) {$numMatches = count($matches[0]); for ($i=0;$i < $numMatches; $i++) {echo "matches[0][$i] is ".str_replace("<","<",$matches[0][$i])."<br>"; } for ($i=0;$i < $numMatches; $i++) {echo "matches[1][$i] is ".str_replace("<","<",$matches[1][$i])."<br>"; } } else { echo "No"; } ?> • Output <p>This is paragraph 1.</p> <p>And this is paragraph 2.</p> <p>Paragraph 3.</p> matches[0][0] is <p>This is paragraph 1.</p> matches[0][1] is <p>And this is paragraph 2.</p> matches[0][2] is <p>Paragraph 3.</p> matches[1][0] is This is paragraph 1. matches[1][1] is And this is paragraph 2. matches[1][2] is Paragraph 3.
PHP Filesystem Functions • Chapter XXXVIII of the PHP manual • We will consider just four • resource fopen ( string filename, string mode [, bool use_include_path [, resource zcontext]] ) Usually used as resource fopen ( string filename, string mode) • bool fclose ( resource handle ) • string fread ( resource handle, int length ) • int fwrite ( resource handle, string someString [, int length] ) Usually used as int fwrite (resource handle, string someString )
fopen • Typical call format: resource fopen ( string filename, string mode) • Example calls $fileHandle1 = fopen("names.txt","w"); opens, for writing, a file called "names.txt" in the same directory as the PHP program $fileHandle1 = fopen("names.txt","r"); opens, for reading, a file called "names.txt" in the same directory as the PHP program $fileHandle1 = fopen(" "/usr/csr/names.txt ","w"); opens, for writing, a file called "/usr/csr/names.txt" on the same computer as the PHP program $fileHandle1 = fopen(" "/usr/csr/names.txt ","r"); opens, for reading, a file called "/usr/csr/names.txt" on the same computer as the PHP program $fileHandle1 = fopen("http://www.rte.ie/index.html","r"); opens, for reading, a file on an external web-site
fread and file_get_contents • Typical call format: string fread ( resource handle, int length ) • Example calls $contents = fread($fileHandle1,1000); reads the next 1000 bytes from the file with handle $fileHandle1 or up to the end of the file if there are less than 1000 bytes still unread in the file $contents = fread($fileHandle1,100000000); reads the next 100 MB bytes from the file with handle $fileHandle1 or up to the end of the file if there are less than 100 MB bytes still unread in the file -- since very few files are as large 100 MB, this probably just makes the computer read up to the end of the file; • Typical call format: string file_get_contents ( string filename [, bool use_include_path [, resource context [, int offset [, int maxlen]]]] )Example call $contents = file_get_contents($someURL);
fwrite • Typical call format: int fwrite ( resource handle, string someString ) • Example call $result = fwrite($fileHandle1,"<h1>Blah blah</h1>"); writes the string <h1>Blah blahM/h1> into the file with handle $fileHandle1 and returns the number of bytes written into the file or returns 0 (FALSE) if there was an error
fclose • Typical call format: string fclose ( resource handle) • Example call fclose($fileHandle1); closes the file with handle $fileHandle1
Example usage • PHP code: <?php $rte = fopen("http://www.rte.ie/","r"); $contents = fread($rte,100000000); fclose($rte); echo str_replace("<","<",$contents); ?> • Output: <html> <head> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-15"> <title>RTE.ie - Irish Public Service TV and radio stations online</title> <META name="Description" content="RTE.ie - Irish public service television and radio broadcaster on the World Wide Web - bringing you Irish news, sports, business, entertainment, weather, television and radio, programmes, current affairs, health, motors, travel, video and audio."> <META name="Keywords" content="rte, rte.ie, irish, television, radio, ireland, Irish, news, business, sport, results, news, Ireland, video, audio, broadcaster, irish"> <STYLE TYPE="text/css"><!-- A {text-decoration: none; color: #000000} A:hover {text-decoration: none; color: #660000} --></STYLE> <STYLE TYPE="text/css"> <!-- FORM {display:inline;} --></STYLE> <script language="JavaScript"> <!-- function AertelPage() { var
Compare output of program with page seen in browser <html> <head> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-15"> <title>RTE.ie - Irish Public Service TV and radio stations online</title> <META name="Description" content="RTE.ie - Irish public service television and radio broadcaster on the World Wide Web - bringing you Irish news, sports, business, entertainment
Example application • Extracting output from website of The Guardian: • The Guardian maintains a page, updated almost daily, of recent stories on Israel and the Middle East at http://www.guardian.co.uk/israel • Its appearance on 25 October 2005 is on the next slide • The first part of the the source code for the page, gotten from a browser, is on the slide after that • The complete source code for the 25 October 2005 version of the page is in the file http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianIsrael25October2005.txt (Note that this is an exact copy of the page from the Guardian site, so the src and href attribute values in the file assume that the page is stored at the Guardian URL above.) • We want to extract the text stories, as shown in the third-next slide
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <!-- artifact_id=377264, built 2005-10-25 11:00 --> <html lang="en"> <head> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"> <meta name="artifact" content="377264"> <title>Guardian Unlimited | Special reports | Special report: Israel & the Middle East</title> <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"> <link rel="stylesheet" href="/external/styles/basic/0,14491,,00.css" type="text/css"> <style type="text/css"> <!-- a.GUARDIANUNLIMITED { text-decoration:none;
We want to produce a page like that on the right-hand side below from that on the left-hand side
Here is the source code around the headline Latest on the page <div class="maintrail"><font face="Geneva, Arial, Helvetica, sans-serif" size="2"> <b>Latest</b><hr size="1"> <p><span class="mainlink"><a HREF="/israel/Story/0,2763,1599840,00.html">Israel still in control of Gaza, says envoy</a></span><br /><b>October 25: </b>The international Middle East envoy, James Wolfensohn, has accused Israel of behaving as if it has not withdrawn from the Gaza Strip, by blocking its borders and failing to fulfil commitments to allow the movement of Palestinians and goods. </p> <b>Qur'an test</b><hr size="1">
Here is the source code around the headline Audio reports <p><span class="mainlink"><a HREF="/israel/Story/0,2763,1596052,00.html">Israel's closed zone</a></span><br /><b>October 20, letters: </b>You graphically highlight the continuing expansionism of the Israeli government (Report, October 18). </p> <b>Audio reports</b><hr size="1"> <p><span class="mainlink"><a HREF="http://stream.guardian.co.uk:7080/ramgen/sys-audio/Guardian/audio/2005/09/12/120905McGreal.ra">Palestinians rush to Gaza</a></span><br /><b>September 12:</b> Many descended on the former Jewish Gaza settlements intent on causing chaos, but others came simply to see the beach for the first time, reports <b>Chris McGreal</b> from Khan Yunis. (2min 33s)
This PHP program will extract all the source code between the two headlines <?php $f1 = fopen("http://www.guardian.co.uk/israel/","r"); $document = fread($f1,100000); fclose($f1); $regexp = "%<b>Latest</b><hr size=\"1\"> (.+)<b>Audio reports</b><hr size=\"1\">%s"; preg_match($regexp,$document,$matches); $stories = $matches[1]; echo $stories; ?>
But we also want to remove all the intermediate headlines and rulings between the stories Wolfensohn, has accused Israel of behaving as if it has not withdrawn from the Gaza Strip, by blocking its borders and failing to fulfil commitments to allow the movement of Palestinians and goods. </p> <b>Qur'an test</b><hr size="1"> <p><span class="mainlink"><a HREF="/israel/Story/0,2763,1599227,00.html">Qur'an competition tests participants' memories</a></span><br /><b>October 24: </b>With senior militant leaders looking on, Palestinian officials opened an international competition yesterday testing participants' knowledge of the Qur'an. </p> <b>Comment and analysis</b><hr size="1"> <p><span class="mainlink"><a HREF="/israel/Story/0,2763,1596291,00.html">Christian leanings at the Jerusalem Post</a></span><br /><b>October 20, Chris McGreal:</b> The strange and uneasy embrace between the Jewish state and America's evangelical right is being tightened. <br /> <a HREF="/israel/comment/0,10551,1590082,00.html">12.10.05, Jonathan Freedland: One and three-quarter state solution</a><br /> <a HREF="/israel/Story/0,2763,1584308,00.html">04.10.05, Chris McGreal: House that became a war zone</a></p> <b>West Bank</b><hr size="1"> <p><span class="mainlink"><a HREF="/israel/Story/0,2763,1596168,00.html">Israel accused of 'road apartheid' in West Bank</a></span><br /><b>October 20: </b>Army seals off main route to Palestinian vehicles <br><b>· </b>Opponents say plan is to carve out new borders. <br />
This program will extract stories and remove all intermediate headlines and rulings between stories <?php $f1 = fopen("http://www.guardian.co.uk/israel/","r"); $document = fread($f1,100000); fclose($f1); $regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s"; preg_match($regexp,$document,$matches); $stories = $matches[1]; $regexp = "%<b>.+</b><hr size=\"1\">%"; /* Equivalent to $regexp = "%<b>.+?</b><hr size=\"1\">%s"; */ $stories = preg_replace($regexp,"",$stories); echo $stories; ?>
Correcting the URLs • The URLs on the Guardian page assume that the page is being delivered from the Guardian server • Anchor elements are of this form: <a HREF="/israel/Story/0,2763,1599840,00.html">Israel still in control of Gaza, says envoy</a> • Thus, if someone clicks a hotlink on our output, the browser will think the target page is on our server • We must make the URLs point to the Guardian server by making them full (or "absolute") URLs
This program corrects the URLsNotice that we need only string manipulation <?php $f1 = fopen("http://www.guardian.co.uk/israel/","r"); $document = fread($f1,100000); fclose($f1); $regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s"; preg_match($regexp,$document,$matches); $stories = $matches[1]; $regexp = "%<b>.+</b><hr size=\"1\">%"; $stories = preg_replace($regexp,"",$stories); $stories = str_replace('HREF="/','HREF="http://www.guardian.co.uk/',$stories); echo "<h1>Today's Guardian stories on Palestine</h1>"; echo $stories ?> • You can run this program at http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor1.php
Putting the stories into a local file <?php $f1 = fopen("http://www.guardian.co.uk/israel/","r"); $document = fread($f1,100000); fclose($f1); $regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s"; preg_match($regexp,$document,$matches); $stories = $matches[1]; $regexp = "%<b>.+?</b><hr size=\"1\">%"; $stories = preg_replace($regexp,"",$stories); $stories = str_replace('HREF="/','HREF="http://www.guardian.co.uk/',$stories); $f2 = fopen("todaysGuardian.html","w"); fwrite($f2,"<h1>Today's Guardian Palestine stories</h1>"); fwrite($f2,$stories); fclose($f2); ?> <a href="todaysGuardian.html">See result</a> • You can run this program at http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor2.php
There is a problem • The trouble is that our program does not have write-access in the directory which contains the program • We must get it to write to a different directory, where it will have write-access
Putting the stories into a file in a write-access directory <?php $f1 = fopen("http://www.guardian.co.uk/israel/","r"); $document = fread($f1,100000); fclose($f1); $regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s"; preg_match($regexp,$document,$matches); $stories = $matches[1]; $regexp = "%<b>.+?</b><hr size=\"1\">%"; $stories = preg_replace($regexp,"",$stories); $stories = str_replace('HREF="/','HREF="http://www.guardian.co.uk/',$stories); $f2 = fopen("writable/todaysGuardian.html","w"); fwrite($f2,"<h1>Today's Guardian Palestine stories</h1>"); fwrite($f2,$stories); fclose($f2); ?> <a href="writable/todaysGuardian.html">See result</a> • You can run this program at http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php
A note on allowing PHP programs to write files • PHP programs are run by your Apache server • On Unix/Linux machines, the Apache server is treated as an ordinary user, with the name "nobody" • Thus, your PHP programs can only write into a directory where user nobody has write -permission
A note on allowing PHP programs to write files (contd.) • This could be a directory where you have given everybody write-access, as in drwxrwxrwx 2 jabowen staff 35 Oct 25 09:58 writable • But this is unsafe • It is better to create a group which contains only yourself and nobody and give write access to that group, as in drwxrwxr-x 2 jabowen jbApach 35 Oct 25 09:58 writable where jbApach is a group that contains jabowen and nobody
Generating this page automatically every day • To generate this page automatically every day, we need to run this program automatically http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php • To do this, we can use some Linux/Unix features • a utility called wget • another one called nohup • a third one called crontab
wget • wget is for non-interactive download of files from the Web. • It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. • wget is non-interactive, meaning that it can work in the background, while the user is not logged on • This allows you to start a retrieval and disconnect from the system, letting Wget finish the work. • By contrast, most of web browsers require constant user's presence
Using wget (contd.)Output is saved into a local file with same name as remote web-page
Using wget (contd.)Local file contains exactly the output from our program, i.e. without any HTTP headers
The local file generated by wget contains the output from our program • Of course, we are not interested in that output • We are simply interested in the fact that that file output/todaysGuardian.html was generated • We are not even interested in hanging around while wget excutes our web program • Thus, we use another utility, called nohup, to call wget to run our program nohup wget http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php
Automating all this • Suppose we want to be able to look at the file output/todaysGuardian.html first thing every morning • Suppose we want to ensure that it is updated each morning before we wake up • We can use the crontab utility to make this happen
crontab • crontab will execute programs automatically at times we specify • To ask it to execute our program, 7 days a week at 3:30 AM, use the following: 30 3 * * 1,2,3,4,5,6,7 nohup wget http://cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php