100 likes | 229 Vues
This chapter explores various tools for data processing, focusing on regular expressions, shell commands, shell scripts, and Python programs. It outlines how to select the most suitable tools based on your data type, including user input, internet data, and outputs from other programs. The guide delves into techniques for gathering data from online sources and converting formats, emphasizing the importance of appropriate tool selection for efficiency. Gain insights into automating command-line operations and the nuances of interacting with spreadsheets and internet data securely.
E N D
Chapter 14 Selecting and Combining Tools F. Duveau 02/03/12
Yourtoolkit - Regular expressions To search and replace - Shell commands To interactwithyour computer at the command line - Shell scripts To combine / automate command-lineoperations - Python programs For more advancedprocessing How to chose the convenienttools for a particulartask? It depends on the kind and number of input data and the amount and frequency of work to perform
Type of input data - Data from user input raw-input() sys.argv[] - Data from the Internet curl() or wget() commands, urllib2 module - Data fromother programs Output of other programs canbecapturedusing redirection (> or >>) or the pipe operator (|) - Data from hardware Chapter 22 gives background on ho to interactdirectlywith the physical world - Data fromspreadsheets Not suitable to manage large and complexdatasets
Data fromspreadsheets More suited to smallone-offprojects and simple record-keeping The first stepisusually to resave the data in a basic text format (eg CSV) xlrd Python package to accessspreadsheet file contents .xlsxfiles are compressed archives containing XML files
Correction of the exercise - Input data: html files obtainedfrom the MWG website - Output: csv file containingnames, sequences and ordernumbers STEP 1: Create a list of primer sequences and namesfrom 1 file STEP 2: Extendthislist to multiple files (fixednumber of files) STEP 3: Retrieve the filenamesfrom the ordernumbers in manage-order.hmtl Automaticallyloop over all orders STEP 4: Create the list of ordernumbers STEP 5: Write the output file
Collecting data from the Internet The ideais to adapt the script to find the information directlyfromwebpagesinstead of manuallydownloading html files Problem: The websiteissecuredwith HTTPS protocol. https://ecom.mwgdna.com/services/track/manage-orders.tcl Strategy to solvethisproblem(to beimplemented in your Python script): 1) Change the default User Agent of Python to simulate a classical web browser 2) Use a Python module (eg. httplib) thatcansend HTTP orders to the server 3) Send the username and password to the authentication HTTPS address 4) The server shouldsend a responseallowingyour program to accessyour primer data