470 likes | 718 Vues
Unix for Bioinformaticists: Unix Tools, Emacs, and Perl. helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation. Do I Have to Know/Use Unix?. Simple answer: no. Windows can do almost everything. Complicated answer: yes, if you
E N D
Unix for Bioinformaticists: Unix Tools, Emacs, and Perl helpdesk at stat.rice.edu Aug 2004 Some slides are borrowed from Dr. Woely’s (BCM) presentation.
Do I Have to Know/Use Unix? • Simple answer: no. • Windows can do almost everything. • Complicated answer: yes, if you • are lazy (would like to automate things) • are good at reading manuals and writing scripts • want to make better use of your machine • are as poor as I am (can not afford pricy windows software) • especially if you will be a bioinformaticist
Why Unix Is Useful in Bioinformatics • Many tasks involve processing on large text based datasets. Unix tools in many cases are better than their windows counterparts. • You may need to use several tools to accomplish a task. Windows is not particularly good at gluing them. • When you need more CPU power, servers and clusters are usually *nix-based. • Many tools are available only under Unix-like systems.
Outline • Unix in general • Unix tools • Emacs • Perl
Unix Commands Single command: > sort –k1 file.txt Combine other commands: > sort –k1 file.txt | grep “Tag=Mouse” > output.txt Operate multiple files: > foreach file (*.txt) sort –k1 $file > $file:r_sorted.txt end
More commands > rename .html .htm *.html There are many such convenient tools. Scripts can be used if you can not find one, > foreach f (*.html) mv $f $f:r.htm end
More commands > wget -r -l1 --no-parent -A.tar.gz -Ppackages http://cran.r-project.org/src/contrib/PACKAGES.html download all .tar.gz files to packages directory, This command can do everything ‘teleport’ etc. under windows can do. > convert –rotate 90 file.jpg file.png Convert a .jpg file to .png format after rotating 90 degrees.
A shell script: lyx2pdf > lyx2pdf myfile.lyx #!/bin/csh set file = $1:r lyx --export latex $file.lyx latex $file.tex dvips -o $file.ps $file.dvi ps2pdf $file.ps
A Makefile %.html: %.tex latex2html -local_icons -no_subdir -split 0 $*.tex %.tex: %.lyx lyx2tex $*.lyx %.dvi: %.tex latex $*.tex %.ps: %.dvi dvips -o $*.ps $*.dvi %.pdf: %.ps ps2pdf $*.ps > make file.dvi > make file.ps > make file.pdf
A Perl Script #!/usr/bin/perl # read all the things at once undef $/; # read in the file and look for /* */ ($comm) = <> =~ /.*\/\*(.*)\*\//ms; # print comments print $comm, "\n";
crontab # do not forget to renew your library books 0 0 15 7 * mail bpeng@rice.edu %subject reminder Renew all the books! # backup your files to server every day at 6AM 6 * * * * /usr/local/bin/rsync -avz /home/bpeng thor.stat.rice.edu::backup > logfile
Graphviz > dot –Tps try.dot –o try.eps File: try.dot digraph G { A->B->C B->D->C }
Useful (and free) tools Servers: Apache, openssh, openldap Web: Mozilla/firefox, Konqueror, lynx Mail clients: Pine, Mutt, Mozilla/thunderbird, kmail, evolution Text processing: tetex/lyx, open office, koffice Languages: gcc, Perl, python, gmake, kdevelop Scientific libraries and tools: GNU Scientific Library, bioPython, bioPerl, R, Graphviz, gnuplot, octave Misc: VNC, wget,
Unix text-processing tools • Access to Unix • Mac OSX + developers kit • Linux • Stat and ruf/owlnet servers (Solaris) • Windows + cygwin • Tools - in contrast to Excel, faster, operate on larger files • Grep, Pipes, Sort, Comm, Diff, Join • Sed - regular expression substitution editor, replaced by perl in most contexts • Man - to list manual pages with options for most commands (if installed and concurrent version)
Grep • Grab lines that match a text phrase • Only the line that matches • Lines before or after the matched line • Lines that do not match • Piping multiple searches
Grab the Locus, Definition and Keyword lines phase2.txt.out temp
Select Non-Human Definition Lines and Use Pipe kworley% grep -v Homo temp | grep DEF temp
Specify Lines to return grep -1 grep -B1 grep -A1
Sort • In dictionary (-d), month (-M), or numerical (-n) order • Ignore case (-f) • Specify output file (-o) • Specify the separator between fields (-t) • Unique lines only (-u) • Specify field on which to sort (-k POS1,[-POS2]), numbered starting from 0, can specify which character in the field (field.char) • Merge more than one sorted file (-m)
Comm • Select or reject lines in common between two sorted files • Options suppress printing of columns • comm [-123] file1 file2 • Column 1 is lines only in file 1 • Column 2 is lines only in file 2 • Column 3 is lines in both files
Diff • Compares two files (or sets of files in a directory) and output lines with differences • Compare as text (-a) • Ignore changes in white space (-b) or blank lines (-B), case difference (-i) • For directory comparisons • Report only files that differ not details (-q) • Compare subdirectories recursively (-r)
Join • Combines lines from two files based on a common field (-1 field -2 field) • Specify the fields from each file and the order to output (-o file_number.field file_number.field file_number.field)
What is Emacs? • A Unix text editor with additional functionality • Column functions • Settings for DNA mode • Settings for programming mode • Seamless integration with matlab, R, S-Plus, SAS etc.
Emacs Demonstrations • Search and replace • By query • All • New lines • Counting things • Column functions • Select • Kill • Copy • Paste
Query replace • Esc % • Replace phrase • With phrase • Designate carriage return with control Q control J • Y or N • ! To replace all
Rectangle functions • Mark, select rectangle • Control x r • r a • To register the rectangle as buffer a • k • To kill the rectangle • r i a • To insert previously registered rectangle a from buffer
What is Perl? • A general purpose programming language. • Invented to replace awk, sed, and sh. • A scripting language. • Practical Extraction and Reporting Language • Pathologically Eclectic Rubbish Lister “There is more than one way to do it” TIMTOWTDI
How to Use Perl • Perl “scripts” (programs) are text and are interpreted by the the perl program. • TIMTOWTDI: • You can put the script on the command line:>perl -e 'print "Hello, world!\n";' • You can pass it as an argument to perl:>perl my_program.pl • You can make the script self-executing:>my_program.pl
print, ", ', \n 'print "Hello, world!\n";' • In most programming languages, "print" means "display" or "output". • The single and double quote characters ( " ' ) are used to set apart blocks of "text". In this example, the single quote sets apart the perl script, and the double quotes sets apart the text to display. (Perl has others ways to quote.) • The backslash, '\', is used to change the meaning of a character, e.g. to generate special characters. \n means "start a new line" (e.g. the Carriage Return, or Return, or Enter.)
Example of a One Liner(Thanks to Dr. Wheeler) perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out.txt perl -nle '@f=split/\t/; print if ($f[2] > 95);' blast_tbl_in.txt >blast_tbl_out.txt
A One Liner: TIMTOWTDI • perl -nle '@f=split/\t/; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out1.txt • perl -ne '@f=split; print if ($f[2] > 95 );' blast_tbl_in.txt > blast_tbl_out2.txt • perl -ane 'print if ($F[2] > 95 );' blast_tbl_in.txt > blast_tbl_out3.txt
split, if, variables @f=split/\t/; print if ($f[2] > 95); • split is a function. It can be written with parens like in most languages, and takes UP TO three arguments:split( where_to_split, what_to_split, how_many_to_split) • split, like many Perl statements, uses defaults for missing arguments. • Special characters mark @whole_arrays, $array_members[1], %whole_hashes, $hash_members{'one'}, $simple_variables. • if acts like its common English meaning. It can go before a block or at the end of a statement (as above). • Perl converts between numbers and text. '>' is a numeric operator so 95 and $f[2] are treated as numbers. If gt replaced >, they would be treated as strings.
FASTA to XML perl -pi.bak -e's"^>(.*)$"</seq><title>\1</title><seq>";'test.fa
[localhost:~/test] steffen% ls test.fa test.fa.bak [localhost:~/test] steffen% perl -pi.bak -e 's"^>(.*)$"</seq><title>\1</title><seq>";' test.fa [localhost:~/test] steffen% ls test.fa test.fa.bak [localhost:~/test] steffen% more test.fa </seq><title>CSTAP1E0101A</title><seq> gttgcctgcgtcttcggxaacaacgtagttctcagGCCGCCCGACCAGGT ACTTTTTTGCTTTTTTTTTTTTTATTTTTTACAAATTATCAAAAGTTCTT GTGCTTTCAGGAGCGATTAACATTCTCATGGGCCATACCCTTGTCAGGTT TCATAAACTAAGTTAGATGGACCTGCTTGGTATTGTGGTGGAAGACCTCC AAGAAAACAAAGTCCCGGAATCTCAACGTCCTCTGTCTTCTGGCATTTCA TCTTCAAGAAACAATGTCTTATAGTTATTATTGCATGTTTTGGGAGGTTA AAGGGTAAAGTTTGTAATGCCTTGACTAAAAACTTCCAGTTGTTATGGTG cacaacaatttttggtatgctaacttatacttgtgcctaatccttaagga aaagaaagagccatatacctaaaactgactttatttttcaaaaggta </seq><title>CSTAP1E0102A</title><seq> tttttgctggcgaactatcaggagactacagxaactacttttcagtxcga actcacatcatcactggccgtcgttttacaacgtcgtgattgggaaaacc ctggcgttacccaacttaatcgccttgcagcacatccccctttcgccagc tggcgtaatagcgaagaggcccgcaccgatcgcccttcccaacagttgcg cagcctgaatggcgaatggcgcctgatgcggtattttctccttacgcttt caatgatgagcacttxtaaaggtctgx </seq><title>CSTAP1E0103A</title><seq> atttgagcagcatctattgaaaactaxcgxagxtcttcaggcgcgCCCAC CCGAGGTACTACCAAGCCAGTGTCCTGCCCGGTTTTAAGCCCTCGTCCTC TCCCTTCGCTCTCCTCCAAACTGAGCAGCATTAGTTCCACAAGCACAGAA GTTAAACGAAAAACTGTCTTGCTCCACGGTCTCCTACAGTAGAATGCTGG ATAATAATGCTTTCAGAAGCCACTTCTACAACCAGAACATTCTGACCACC ACAATCATCAGGTTTACACACACCCTACGAAACACTAGCGAGTTAACAAG actgatgaactacttgcagtcgaactccaatcattactggccgtcgtttt aa
Executing a Perl Script in a File $line = <>; $line =~ s">(.*)"<title>\1</title><seq>"; print $line; while( $line = <> ) { $line =~ s">(.*)"</seq><title>\1</title><seq>"; print $line; } print "</seq>\n";
File Reading, Binding, while $line = <>; • <> reads one line from the "current file" $line =~ s">(.*)"<title>\1</title><seq>"; • =~ makes the preceding string the "current line" (Binding) while( $line = <> ) { print $line; } • Repeats the statements between { and } while there is another line.
Self-executing Perl Scripts • You need to know the path to your Perl program:>which perl/usr/bin/perl • The first line of your script must be:#!/usr/bin/perl • Permissions need to allow execution >chmod 755 my_program.pl
FASTA to XML Fleshed Out #!/usr/bin/perl # # fasta2xml by David Steffen 6/2/2004 # - Converts fasta file to mini-xml format $inpfile = shift( @ARGV ); if( not( $inpfile =~ m/^(.*)\.fa$/ ) ) { die( "Input file, $inpfile, must be a fasta file and end in .fa\n" ); } $basefile = $1; open( INPFILE, $inpfile ) or die( "Can't open $inpfile: $!\n" ); $outfile = '>' . $basefile . '.xml'; open( OUTFILE, $outfile ) or die( "Can't open $outfile: $!\n" ); $line = <INPFILE>; $line =~ s">(.*)"<title>\1</title><seq>"; print OUTFILE $line; while( $line = <INPFILE> ) { $line =~ s">(.*)"</seq><title>\1</title><seq>"; print OUTFILE $line; } print OUTFILE "</seq>\n";
Running Other Programs from Perl $files = `ls`; The "backtic" (` `) characters execute the text in between as a command to the operating system, returning the output of that command (e.g. to the $files) variable. $error = system( "mv $file ${basefile}.abi" ); The system statement executes its argument as a command to the operating system, returning ERROR MESSAGES from that command. (Output is printed as usual.) There are other, subtle differences between ` ` and system.