1 / 29

Grid Usecase BioMed

How to get biologists to compute. Grid Usecase BioMed. Surfnet / Grid Tutorial. Jan Bot. Vermelding onderdeel organisatie. Who am I. Graduated March 2008 Bioinformatics group TU Delft BioAssist programmer Happy grid user Working on the grid as part of the TU Delft – NKI collaboration

hazel
Télécharger la présentation

Grid Usecase BioMed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to get biologists to compute Grid Usecase BioMed Surfnet / Grid Tutorial Jan Bot Vermelding onderdeel organisatie

  2. Who am I Graduated March 2008 Bioinformatics group TU Delft BioAssist programmer Happy grid user Working on the grid as part of the TU Delft – NKI collaboration Chris Klijn: human copy number variation Jeroen de Ridder: viral insertions in mice

  3. DNA & Genes

  4. Copy number variation & Viral insertions • Pieces of DNA can be added, deleted, moved & removed • Viruses can insert themselves into a genome • This causes all kinds of problems, for example cancer: • Multiple mutations needed before a tumor starts to develop

  5. aCGH data Array comparative genomic hybridization Compare DNA of sample against a reference

  6. KCSmart: Datasets Leukaemia & lymphoma cell-lines aCGH data (10k affy) from the Sanger Institute Same samples measured on 1.8M SNP6 105 cell-line samples About 350 mb of data

  7. KCSmart: Overview For each tumor we construct a pair-wise space by comparing each chromosome arm with each other chromosome arm. A point in this space is a pair of genomic loci.

  8. KCSmart: Compute Co-occurrence Score Using a 2d Gaussian kernel we want to look for local enrichment of high scores in the pairwise space. Peaks in the convolved space allows us to define two genomic loci that can be said to be co-aberrated to a certain degree

  9. KCSmart: Parameters (1)‏ • Chromosome arms: • Natural split at the centromere to better divide work load • Not all p-arms contain measurements (39 out of 44) • Resolution: • 'Grid points' are fixed on the genome • Location of the grid points, and thus the computational complexity, doesn't change when using different datasets • Measurements are allocated to grid points • Tried this for [20, 25, 35, 50] kbp • Choice based on the best resolution which still fits in memory 10k data Grid 1.8m data

  10. KCSmart: Parameters (2)‏ • Scale: • The kernel width in base pairs • Capture changes on different scales: • [0.2, 2, 10, 20] mbp (6 sigma) Amplification type: • Either insertion or deletion • All possible combinations for two chromosomes: • [ins:ins, del:del, ins:del, del:ins] • ins=amplification, del=loss)‏

  11. KCSmart: Getting the Parameters Right 10k data to estimate memory consumption and running times Find best resolution & scale that still fit in 2.3 gb of memory Final Parameters: chr = [1.0, 1.5, ..., 22.5] res = [20000] scale = [0.2, 2, 10, 20] amp = ['ins-ins', 'del-del', 'ins-del', 'del-ins'] Roughly 10k jobs (without the jobs required for finding the correct parameter settings!)‏ All parameters generated using a python script In a jdl it looks like:Parameters={"19.5 15.5 2 1 20000", "2.5 4.0 2 1 20000"};

  12. KCSmart: Output +/- 10k files 7.5 gb of 'peak-info' 1 TB of raw data Problems with the grid: once you have all the scripts in place to run jobs it's easy to create more output than a biologist can analyze once the biologist has some results he'll ask you to do it again (and again...)‏

  13. KCSmart: Results 10k data

  14. KCSmart: Results 1.8m data

  15. KCSmart: Results 1.8M data Found a know deletion pair (T-cell receptor): the method works.

  16. KCSmart: Future work Higher resolution (once we have 64 bit WNs)‏ Smaller scale Mutual exclusiveness tests Run on real tumor dataset

  17. Matlab jobs Compile code using Matlab (on a UI), run using MCR Add ctf & executable to input sandbox:InputSandbox={"kcsmart_topos.sh","kcsmart_large.bin","kcsmart_large_run.ctf","curl.gz"}; Add 'require code' to jdl:Requirements = Member("lsgmcr-7.5",other.GlueHostApplicationSoftwareRunTimeEnvironment); Load module on WN:module load mcr Call executable

  18. Job status tracking problems How do you check which jobs failed? Use output files as indicators:lcg-ls lfn:///grid/lsgrid/jbot/chris_large/output/ > output.txtcat output.txt | ~/code/chris/check_missing.pl > to_do.txt Copy subset of parameters to jdl file Submit job again This takes too long!

  19. The Annoyances: glite-wms-job-* glite-wms-job-status It barely tells me anything (unless I specified error codes myself)‏ I would rather know the number of failed / running jobs the error output or the parameters with which this job was run Use with grep & awk: glite-wms-job-status `job-ids` > status.txt cat status.txt | gawk '{prev=$7;getline;if($0~/Exit\ Code/){print prev;}}' Output: https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g Status info for the Job : https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g Current Status: Done (Exit Code !=0)‏ Exit code: 1 Status Reason: Warning: job exit code != 0 Destination: gb-ce-lumc.lumc.nl:2119/jobmanager-pbs-medium Submitted: Sun Sep 7 21:24:56 2008 CEST

  20. The Annoyances: glite-wms-job-* glite-wms-job-cancel Does not recursively cancel jobs stored in a file Fix: glite-wms-job-status -i jobs.txt | grep 'http' | gawk '{print $7}' > to_cancel.txt glite-wms-job-cancel -i to_cancel.txt Status info for the Job : https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g Current Status: Done (Exit Code !=0)‏ Exit code: 1 Status Reason: Warning: job exit code != 0 Destination: gb-ce-lumc.lumc.nl:2119/jobmanager-pbs-medium Submitted: Sun Sep 7 21:24:56 2008 CEST

  21. The Annoyances: lcg-* lcg-cr Getting files to and from the SEs: What, lcg-cr doesn't always work? On error: try again No error: good to go, right? Try copying the file back to the WN lcg-cp Copying > 3000 files from a SE to the UI machine takes > 1 hour Copying the same files over ssh (scp) to my (remote) machine: ~2 minutes Security overhead? Work-around: lcg-rec-cp: slow custom script (do it in parallel): nasty Both: don't work when the MCR is loaded

  22. ToPoS Main developer: Pieter van Beek WebDav + Tokens + pilot job Instead of submitting one job at a time, claim a (bunch of) computer(s) until all jobs are done

  23. ToPoS Overview (2) Pilot Jobs (1) Job tokens User (6) All Output (4) Job Token ToPoS Server (3) Job Request (5) Job Output The Grid

  24. Token renewal Pilot job Running pilot job Get unused token Submit Finished? Execute token task Pilot job with token no yes Delete token affirm token use

  25. ToPoS: Conclusion Advantages: Easy output handling using Curl with atomic operations Handles failed jobs Less overhead Able to dynamically add or remove nodes Easy to re-run jobs Easy access to output Disadvantages: Little / no security Some overhead at the end of a run (unless you're reserving tokens)‏ Feature request: progress bar

  26. Fixing the difficulties: LEARN BASH! diff is your friend: Useful to transfer missing files to and from SE grep Usefull for querying status of jobs (use with the -c option)‏ (g)awk Handy to cancel jobs Redirect output to file and push processes to background: lcg-ls is a typical example

  27. Why not let the biologist do it? Recourses needed to get this working on the grid: +/- 180 replies from grid support +/- 100 messages exchanged with the biologists Many hours of work, mostly finding out about the 'quirks' of the software Advantage of making a programmer submit the jobs: One person to handle support Re-use experience with other projects

  28. Some other tricks Nikhef does not 'advertise' the installed software Do your own load balancing (once the job is in a queue, it doesn't get re-scheduled)‏ Easy to do with the cancel-script shown previously Don't keep your stuff in $home when on WNs, change directory to $TMPDIR at the beginning of your script Keep in mind: once you retrieved your job-output it's gone from the grid Use startGridSession When using ToPoS: make sure you land in the 'long' queue

  29. Thanks! Sara Grid Support Jeroen Engelberts Pieter van Beek Machiel Jansen NikHef Jan Just Keijser Collaborators Chris Klijn Jeroen de Ridder

More Related