80 likes | 213 Vues
This document outlines the methodology for conducting RNA sequencing analysis using the BLASTX tool on the RCAC clusters for large transcriptomic datasets. Specifically, it details breaking down input files into manageable sizes (~200 KB, ~40 KB) and leveraging Condor for job scheduling to optimize CPU usage. The process emphasizes accuracy, reliability, and speed, and addresses challenges such as job failures and manual retries. This approach has been applied across various samples, enabling efficient handling of extensive genomic data (~100 MB input files with ~10M hits).
E N D
Rick Westerman Purdue Genomics westerman@purdue.edu
blastx Nucleotide to protein database Denovo Transcriptome / RNAseq 30K – 150K sequences 300 – 5000 bases ~100 MB input file E-value is 10-6 Up to 10M hits to 'nr' ~5000 CPU-hours
1) Break up input into many ~200 KB files – about 500 of them. 2) Grab up to 250 8-cpu 'standby' nodes on RCAC clusters; 4 hour maximum Note: use own queuing method (“chaining”) 3) Failures are manually caught and re-done. 4) Do above for each sample (experiment) Current method – RCAC clusters
1) Break up input into many ~40 KB files. 2) Toss all files onto Condor. Blast is setup to use 8 cpus. Only current restriction: 1 GB memory. 3) Condor retries up to 5 times. After that failures are manually caught and re-done. 4) Do above for each sample (experiment) Condor method
Use cases • Accuracy • Reliability • Speed – plant -- insect
1650 jobs …which started up 11,500+ times ... 5919 Abnormal termination (signal 1) 3667 Normal termination (return value 129) 2034 Job was evicted. 85 Abnormal termination (signal 9) 74 Normal termination (return value 0) 1 Normal termination (return value 1) Case #6 failure reasons