450 likes | 483 Vues
BMB 6216 – Algorithms for Biology - Class 1. Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell a.kudlicki@utmb.edu. BMB 6216 – Algorithms for Biology. Welcome! Imagine doing science without computers? It can (almost all) be done: Paper file folders Xeroxing
E N D
BMB 6216 – Algorithms for Biology - Class 1 Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell a.kudlicki@utmb.edu
BMB 6216 – Algorithms for Biology • Welcome! • Imagine doing science without computers? It can (almost all) be done: • Paper file folders • Xeroxing • Photographs on film • Actually going to the library to browse journals • Abstract collections • Telephone, Snail-mail, Telegrams • Typewriters
BMB 6216 – Algorithms for Biology The one exception: Science is quantitative, and has always been.
BMB 6216 – Algorithms for Biology • This course: • Using computers for computing. • Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71.12 = ? ) • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity • BLAST, genome assembly, motif discovery, ...
BMB 6216 – Algorithms for Biology • This course: • Using computers for computing. • Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71.12 = ? ) • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity • BLAST, genome assembly, motif discovery, ... spreadsheets ( Solved, software available )
BMB 6216 – Algorithms for Biology – Class 1 Course Overview Class 1 Introduction to the course and to the Perl programming language Class 2 Computational complexity and numerical stability of algorithms Class 3 Data Structures and Containers in PERL and other languages 1. Tables, lists, queues, hashes and when to use them 2. When PERL is not enough: A quick look at R and C++ Class 4 Matrix operations; Principal Component Analysis; ICA Class 5 Network / graph algorithms 1. Interaction Networks 2. Regulation networks 3. Graphs for enumerating hypotheses
BMB 6216 – Algorithms for Biology Course Overview Class 6 Strings and Regular Expressions 1. In silico enzyme digestion 2. Gene translation Class 7 Randomization and Monte Carlo simulations 1. Randomization by permutation 2. Modeling the null-hypothesis probability distribution Class 8 Custom vector graphics: generating SVG from your data 1. Create and re-create the killer graph for your paper Class 9 Visualization of multidimensional data Class 10 Web tools 1. The components of a web page, elements of HTML. 2. Extracting data from webpages and other documents. 3. Connect to GenBank using BioPerl
BMB 6216 – Algorithms for Biology Course Overview Class 11 Cgi-bin: Creating dynamic web-based tools for data analysis. Class 12 Relational databases and SQL 1. Relational Model, normalization 2. Basic SQL 3. Examples: Experimental results, Class 13 Databases and WWW Class 14 Clustering 1. Hierarchical 2. K-means 3. friends-of-Friends Class 15 Timecourses and spectral analysis; Convolution.
BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux Perl, also C/C++, R, shell, awk, sed, ..., when needed Supplementary reading: Larry Wall et al: Programming Perl Wing-Kin Sung: Algorithms in Bioinformatics James Tisdall: Beginning Perl for Bioinformatics James Tisdall: Mastering Perl for Bioinformatics Stroustrup: The C++ Programmming Language Special requests: Welcome !
BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux • * Rich in standard tools, mostly open-source • * Industry standard • * Very similar to MacOS, Android, iOS, BSD, ChromeOS, etc. • Has many flavors created for specific purposes
BMB 6216 – Algorithms for Biology Using your laptop in class: To get a *nix environment: * linux laptop (or unix console on Mac) • Live CD distribution * cygwin * virtual machine * remote session (preferred, guaranteed to work)
Remote session: Use • “Remote Desktop Connection” from win* • Server: 129.109.88.185 • From mac – install “Remote Desktop Connection Client for Mac” • From Linux “rdesktop 129.109.88.185” Also works from off campus • (mycitrix.utmb.edu -> remote desktop session) Other options: • ssh (puTTY on windows) , no graphics though, only on-campus • NX NoMachine
BMB 6216 – Algorithms for Biology Login to: 129.109.54.80 Username: Password:
BMB 6216 – Algorithms for Biology Unix / linux shell / command line: • List files: ls ls -a ls -1 ls -l ls -lrt • Directory: cd pwd • Copy, move, delete, link: cp mv rm ln • Machine status: ps w uptime top df du whoami /sbin/ifconfig date • Text editors: joe nano emacs (c-x c-f) vi • Pager: more less; also: cat, head, tail, tac • Misc: echo tr sed man wc chmod
BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac
BMB 6216 – Algorithms for Biology Exercise: The file /data/students/classes/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W) • List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)
BMB 6216 – Algorithms for Biology Log in to your account (on 129.109.88.185) • Make a fresh directory, e.g. mkdir bmb6216 cd bmb6216 mkdir class_1; cd class_1 cp /data/students/classes/hello.pl . * Cat it. * Less it. * Run it. • Backup: cp hello.pl hello-0.pl • Edit it: vi hello.pl
BMB 6216 – Algorithms for Biology Editing with vi • I / i (insert) • A / a (append) • X / x / dd (delete) • R (eplace) / r (eplace 1 character) • {n} W / w / B / b / hjkl -move around • [ESC] – back from insert to command • ZZ / :w / :q / :wq / :x / :q! - exit / save / quit • xp – swap chars. ddp – swap lines
BMB 6216 – Algorithms for Biology Exercise: The file /home/students/classes/Class_1/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W), named genes also have a common name in column 2. • List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)
BMB 6216 – Algorithms for Biology PERL Why PERL? Practical Extraction and Report Language Pathologically Eclectic Rubbish Lister • Versatile, portable • Widely used in bioinformatics and web applications • There's more than one way to do it • Not the most elegant language, great for dirty hacks • Easily integrated with anything
BMB 6216 – Algorithms for Biology Warning: PERL6 ain't PERL
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: print ''Hello \n'';
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: > perl print ''Hello \n''; ^D
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: >perl -e 'print ''Hello \n'';'
BMB 6216 – Algorithms for Biology PERL HELLO WORLD: hello.pl ================== #!/usr/bin/perl print ''Hello \n''; ================== > perl hello.pl Or > ./hello.pl (after chmod +x hello.pl)
BMB 6216 – Algorithms for Biology VARIABLES: Scalar: $dna = 'ATTTGCCCTGCCCATT'; $mouse_tail_inches = 2.13; $RNA = ''GGGUUCAAUAUAUGGC''; $seven = -6; Default variable: $_ No need to declare variables. If not specified, $_ is assumed.
BMB 6216 – Algorithms for Biology VARIABLES: No need to declare variables. Risky though: $my_variable = 51; $something = $my_variable + 3; $something_else = $myvariable + 4; use strict;
BMB 6216 – Algorithms for Biology OPERATIONS: String: $dna = “ATAGAGGTA” . “CATATC”; $at_repeat = “AT” x 50; substr() sub-string length() Binding: print $dna if $dna =~ /ATA/; chop (last char) chomp (end of line) Special characters: \t \n
BMB 6216 – Algorithms for Biology The different quotations $x=6; print ''x= $x \n''; print 'x= $x \n';
BMB 6216 – Algorithms for Biology OPERATIONS: Arithmetic: $a + $b $a - $b $a * $b $a % $b $a ** $b
BMB 6216 – Algorithms for Biology OPERATIONS: Incrementation (C-like) $a ++ $a *= 4 $repeat = 'AT'; $repeat x=36;
BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3.21, 7, 'cat', ''dog''); $a[0] = 6; $#a address of last element @a + 0 size of array OPERATIONS: * join / split * push / pop / shift / unshift
BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3.21, 7, 'cat', ''dog''); $a[0] = 6; $#a address of last element @a + 0 size of array OPERATIONS: * join / split * push / pop / shift / unshift
BMB 6216 – Algorithms for Biology HASHES: The most important data type in biology! $expression{''RPS16''} = 4.65; %expression = ( RPL12 => 1.23, CDC28 => 5.31, STAT1 => ''experiment gone south” );
BMB 6216 – Algorithms for Biology FLOW CONTROL: if ( $a > 4 ) { print sqrt ($a), “\n”; }; while ( $x > 0 ) { print --$x , “\n”}; $x>0 or $x = 6; for $z (1..333) {print $z, ' ';}; for ($i=0; $i<=1000; ++$i) { next unless $a[$i] > 0 };
BMB 6216 – Algorithms for Biology TRUE or FALSE false strings: • ''0'' • '''' Every other string is true! ''0.00'' is true ''0.00'' + 0 is false • if ( 'Elvis is alive' ) { print 4+5, “\n”; }; • undef() is false
BMB 6216 – Algorithms for Biology SUBROUTINES sub addit { my ($x1, $x2) = @_; return $x1 + $x2; };
BMB 6216 – Algorithms for Biology Input / Output: while (<>) { chomp; $sum += $_; };
BMB 6216 – Algorithms for Biology Input: open BLABLA, “data.csv”; $firstline = <BLABLA>; @headers = split “\t”, $firstline; while (<BLABLA>) {something}; close BLABLA;
BMB 6216 – Algorithms for Biology Output: • print $x, ''\n''; • printf ''format'', $x; • print + join '' '', @list; open BLABLA, “>outdata.csv”; print BLABLA $x, $y, ''\n''; #no comma!!! close BLABLA;
BMB 6216 – Algorithms for Biology Exercises: 1. repeat in PERL the awk/sort exercise from last hour 2. a-S_cer_TANAY_1000upstream.fasta contains the sequences out UTRs of genes. What is the correlation between the position of GATGAGA sequence and avg expression of the gene?
BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac
BMB 6216 – Algorithms for Biology C / C++ -> for total control =========================== Hello.C ====== #include <iostream> using namespace std; int main () { cout << "Hello :) " << 5+4 << endl; };