1 / 45

Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell a.kudlicki@utmb

BMB 6216 – Algorithms for Biology - Class 1. Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell a.kudlicki@utmb.edu. BMB 6216 – Algorithms for Biology. Welcome! Imagine doing science without computers? It can (almost all) be done: Paper file folders Xeroxing

jennifern
Télécharger la présentation

Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell a.kudlicki@utmb

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BMB 6216 – Algorithms for Biology - Class 1 Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell a.kudlicki@utmb.edu

  2. BMB 6216 – Algorithms for Biology • Welcome! • Imagine doing science without computers? It can (almost all) be done: • Paper file folders • Xeroxing • Photographs on film • Actually going to the library to browse journals • Abstract collections • Telephone, Snail-mail, Telegrams • Typewriters

  3. BMB 6216 – Algorithms for Biology The one exception: Science is quantitative, and has always been.

  4. BMB 6216 – Algorithms for Biology • This course: • Using computers for computing. • Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71.12 = ? ) • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity • BLAST, genome assembly, motif discovery, ...

  5. BMB 6216 – Algorithms for Biology • This course: • Using computers for computing. • Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71.12 = ? ) • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity • BLAST, genome assembly, motif discovery, ... spreadsheets ( Solved, software available )

  6. BMB 6216 – Algorithms for Biology – Class 1 Course Overview Class 1     Introduction to the course and to the Perl programming language Class 2     Computational complexity and numerical stability of algorithms Class 3     Data Structures and Containers in PERL and other languages 1.     Tables, lists, queues, hashes and when to use them 2.     When PERL is not enough: A quick look at R and C++ Class 4     Matrix operations; Principal Component Analysis; ICA Class 5     Network / graph algorithms 1.     Interaction Networks 2.     Regulation networks 3.     Graphs for enumerating hypotheses

  7. BMB 6216 – Algorithms for Biology Course Overview Class 6     Strings and Regular Expressions 1.     In silico enzyme digestion 2.     Gene translation Class 7     Randomization and Monte Carlo simulations 1.     Randomization by permutation 2.     Modeling the null-hypothesis probability distribution Class 8     Custom vector graphics: generating SVG from your data 1.     Create and re-create the killer graph for your paper Class 9     Visualization of multidimensional data Class 10     Web tools 1.     The components of a web page, elements of HTML. 2.     Extracting data from webpages and other documents. 3.     Connect to GenBank using BioPerl

  8. BMB 6216 – Algorithms for Biology Course Overview Class 11     Cgi-bin: Creating dynamic web-based tools for data analysis. Class 12     Relational databases and SQL 1.     Relational Model, normalization 2.     Basic SQL 3.     Examples: Experimental results, Class 13     Databases and WWW Class 14     Clustering 1.     Hierarchical 2.     K-means 3.     friends-of-Friends Class 15     Timecourses and spectral analysis; Convolution.

  9. BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux Perl, also C/C++, R, shell, awk, sed, ..., when needed Supplementary reading: Larry Wall et al: Programming Perl Wing-Kin Sung: Algorithms in Bioinformatics James Tisdall: Beginning Perl for Bioinformatics James Tisdall: Mastering Perl for Bioinformatics Stroustrup: The C++ Programmming Language Special requests: Welcome !

  10. BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux • * Rich in standard tools, mostly open-source • * Industry standard • * Very similar to MacOS, Android, iOS, BSD, ChromeOS, etc. • Has many flavors created for specific purposes

  11. BMB 6216 – Algorithms for Biology Using your laptop in class: To get a *nix environment: * linux laptop (or unix console on Mac) • Live CD distribution * cygwin * virtual machine * remote session (preferred, guaranteed to work)

  12. Remote session: Use • “Remote Desktop Connection” from win* • Server: 129.109.88.185 • From mac – install “Remote Desktop Connection Client for Mac” • From Linux “rdesktop 129.109.88.185” Also works from off campus • (mycitrix.utmb.edu -> remote desktop session) Other options: • ssh (puTTY on windows) , no graphics though, only on-campus • NX NoMachine

  13. BMB 6216 – Algorithms for Biology Login to: 129.109.54.80 Username: Password:

  14. BMB 6216 – Algorithms for Biology Unix / linux shell / command line: • List files: ls ls -a ls -1 ls -l ls -lrt • Directory: cd pwd • Copy, move, delete, link: cp mv rm ln • Machine status: ps w uptime top df du whoami /sbin/ifconfig date • Text editors: joe nano emacs (c-x c-f) vi • Pager: more less; also: cat, head, tail, tac • Misc: echo tr sed man wc chmod

  15. BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac

  16. BMB 6216 – Algorithms for Biology Exercise: The file /data/students/classes/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W) • List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)

  17. BMB 6216 – Algorithms for Biology Log in to your account (on 129.109.88.185) • Make a fresh directory, e.g. mkdir bmb6216 cd bmb6216 mkdir class_1; cd class_1 cp /data/students/classes/hello.pl . * Cat it. * Less it. * Run it. • Backup: cp hello.pl hello-0.pl • Edit it: vi hello.pl

  18. BMB 6216 – Algorithms for Biology Editing with vi • I / i (insert) • A / a (append) • X / x / dd (delete) • R (eplace) / r (eplace 1 character) • {n} W / w / B / b / hjkl -move around • [ESC] – back from insert to command • ZZ / :w / :q / :wq / :x / :q! - exit / save / quit • xp – swap chars. ddp – swap lines

  19. BMB 6216 – Algorithms for Biology Exercise: The file /home/students/classes/Class_1/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W), named genes also have a common name in column 2. • List 200 named genes that have the highest (t7+t19+t31)-(t1+t13+t25)

  20. BMB 6216 – Algorithms for Biology PERL Why PERL? Practical Extraction and Report Language Pathologically Eclectic Rubbish Lister • Versatile, portable • Widely used in bioinformatics and web applications • There's more than one way to do it • Not the most elegant language, great for dirty hacks • Easily integrated with anything

  21. BMB 6216 – Algorithms for Biology Warning: PERL6 ain't PERL

  22. BMB 6216 – Algorithms for Biology PERL HELLO WORLD: print ''Hello \n'';

  23. BMB 6216 – Algorithms for Biology PERL HELLO WORLD: > perl print ''Hello \n''; ^D

  24. BMB 6216 – Algorithms for Biology PERL HELLO WORLD: >perl -e 'print ''Hello \n'';'

  25. BMB 6216 – Algorithms for Biology PERL HELLO WORLD: hello.pl ================== #!/usr/bin/perl print ''Hello \n''; ================== > perl hello.pl Or > ./hello.pl (after chmod +x hello.pl)

  26. BMB 6216 – Algorithms for Biology VARIABLES: Scalar: $dna = 'ATTTGCCCTGCCCATT'; $mouse_tail_inches = 2.13; $RNA = ''GGGUUCAAUAUAUGGC''; $seven = -6; Default variable: $_ No need to declare variables. If not specified, $_ is assumed.

  27. BMB 6216 – Algorithms for Biology VARIABLES: No need to declare variables. Risky though: $my_variable = 51; $something = $my_variable + 3; $something_else = $myvariable + 4; use strict;

  28. BMB 6216 – Algorithms for Biology OPERATIONS: String: $dna = “ATAGAGGTA” . “CATATC”; $at_repeat = “AT” x 50; substr() sub-string length() Binding: print $dna if $dna =~ /ATA/; chop (last char) chomp (end of line) Special characters: \t \n

  29. BMB 6216 – Algorithms for Biology The different quotations $x=6; print ''x= $x \n''; print 'x= $x \n';

  30. BMB 6216 – Algorithms for Biology OPERATIONS: Arithmetic: $a + $b $a - $b $a * $b $a % $b $a ** $b

  31. BMB 6216 – Algorithms for Biology OPERATIONS: Incrementation (C-like) $a ++ $a *= 4 $repeat = 'AT'; $repeat x=36;

  32. BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3.21, 7, 'cat', ''dog''); $a[0] = 6; $#a address of last element @a + 0 size of array OPERATIONS: * join / split * push / pop / shift / unshift

  33. BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3.21, 7, 'cat', ''dog''); $a[0] = 6; $#a address of last element @a + 0 size of array OPERATIONS: * join / split * push / pop / shift / unshift

  34. BMB 6216 – Algorithms for Biology HASHES: The most important data type in biology! $expression{''RPS16''} = 4.65; %expression = ( RPL12 => 1.23, CDC28 => 5.31, STAT1 => ''experiment gone south” );

  35. BMB 6216 – Algorithms for Biology FLOW CONTROL: if ( $a > 4 ) { print sqrt ($a), “\n”; }; while ( $x > 0 ) { print --$x , “\n”}; $x>0 or $x = 6; for $z (1..333) {print $z, ' ';}; for ($i=0; $i<=1000; ++$i) { next unless $a[$i] > 0 };

  36. BMB 6216 – Algorithms for Biology TRUE or FALSE false strings: • ''0'' • '''' Every other string is true! ''0.00'' is true ''0.00'' + 0 is false • if ( 'Elvis is alive' ) { print 4+5, “\n”; }; • undef() is false

  37. BMB 6216 – Algorithms for Biology SUBROUTINES sub addit { my ($x1, $x2) = @_; return $x1 + $x2; };

  38. BMB 6216 – Algorithms for Biology Input / Output: while (<>) { chomp; $sum += $_; };

  39. BMB 6216 – Algorithms for Biology Input: open BLABLA, “data.csv”; $firstline = <BLABLA>; @headers = split “\t”, $firstline; while (<BLABLA>) {something}; close BLABLA;

  40. BMB 6216 – Algorithms for Biology Output: • print $x, ''\n''; • printf ''format'', $x; • print + join '' '', @list; open BLABLA, “>outdata.csv”; print BLABLA $x, $y, ''\n''; #no comma!!! close BLABLA;

  41. BMB 6216 – Algorithms for Biology Exercises: 1. repeat in PERL the awk/sort exercise from last hour 2. a-S_cer_TANAY_1000upstream.fasta contains the sequences out UTRs of genes. What is the correlation between the position of GATGAGA sequence and avg expression of the gene?

  42. BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac

  43. BMB 6216 – Algorithms for Biology C / C++ -> for total control =========================== Hello.C ====== #include <iostream> using namespace std; int main () { cout << "Hello :) " << 5+4 << endl; };

More Related