440 likes | 525 Vues
A.R.M.S. Active Resource Management Services For Big Data Processing. Revised Presentation One. Outline. 1: Title 2: Outline 3: Members 4: Mentor 5-6: Societal Issue 7: History 8-9: Dr. Li 10-11: Cluster Computing 12-14: Case Study 15: Accuracy
E N D
A.R.M.S. Active Resource Management ServicesFor Big Data Processing Revised Presentation One
Outline • 1: Title • 2: Outline • 3: Members • 4: Mentor • 5-6: Societal Issue • 7: History • 8-9: Dr. Li • 10-11: Cluster Computing • 12-14: Case Study • 15: Accuracy • 16: Current Major Functional Component Diagram • 17: Current Process Flow • 18: Problem Statement • 19: Proposed Major Functional Component Diagram • 20: Proposed Process Flow • 21-24: Dinosolve Walkthrough • 25: Dinosolve Issues • 26: Software • 27: Hardware • 28: Solution Statement • 29: Competition Identified • 30-32: 508 Compliance • 33: Objectives • 34: Benefits of Solution • 35: Conclusion • 36-39: References • 40-44: Appendix
Group Members and Roles • Scott Pardue (Team Leader) • Michael Rajs (Risk Manager) • Adam Willis (Algorithm Specialist) • Sybil Acotanza (Documentation Specialist) • Jordan Heinrichs (Database Designer) • David Crook (User Interface Designer)
Dr. YaohangLi • Associate Professor in the Department of Computer Science at Old Dominion University. • Research interests include: • Computational Biology: applies computational simulation techniques to solve biological problems • Markov Chain Monte Carlo (MCMC) methods: statistical algorithm for sampling from probability distributions • Parallel Distributed Grid Computing: uses multiple computers communicating via Internet to solve a problem
How do researchers handle the massive amounts of data they are collecting in order to benefit their research?
“Every day, [mankind] create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.”1 http://www-01.ibm.com/software/data/bigdata/
Data Management Examples • Large Hadron Collider 2 • 150 million sensors report 40 million times per second • Facebook 3 • 2.5 billion – content items shared • 2.7 billion – “Likes” • 300 million – photos uploaded • Walmart2 • 1 million customer transactions • 2.5 x 10^15 bytes of data http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/
Dr. Li’s Research • Ideally, his research can be used to develop new protein-modeling programs. Computational approaches can be more efficient and less expensive than biologists, chemists and others experimenting in lab settings • Leads to the manufacturing of additional drugs to fight conditions as varied as Alzheimer’s disease, cystic fibrosis and mad cow disease http://diverseeducation.com/article/13348/
Dr. Li’s Grants • Dinosolve, his current project, was secured for a five year, $400,000 CAREER Award from the National Science Foundation • Dr. Li has been the principal or co-principal investigator on research grants totaling more than $15.3 million
Big Data Analysis Hardware • Cluster Computing 4 • A cluster consists of many nodes (computers). • Big data can be generated and analyzed quicker by spreading the workload amongst the nodes. • Head Node • Logging data • Job submission • 3 Computation Node • 2 Processors each • 4 Execution slots per processor • 24 total execution slots Head node packages data from the computation nodes and presents it in a readable format so that it is usable by the research community
Managing the Cluster Distributed Resource Management Systems (D-RMS) • Job management subsystem • Physical resource management subsystem • Scheduling and queuing subsystem
Dr. Yaohang Li and Dinosolve • Dinosolve examines a protein sequence of amino acids and determines if the protein can be manipulated by an addition of a disulfide bond • Each computational result enhances the prediction accuracies for future results http://hpcr.cs.odu.edu/dinosolve/index.php
Dinosolve Case Study • Bioinformatics7 • Disulfide bond prediction program • Disulfide bond creation is important to the research community
Dinosolve Users • Drug design • Pharmaceutical companies • Antibody design • To combat viruses • Bio-energy development • Creation of new fuels to replace diminishing fossil fuels • Genetic mapping5 • Research to cure cancer, HIV, and other diseases
Accuracy of Popular Tools More users use Dinosolve because of the enhanced accuracy Reference 13,14 and 15
What is the problem? • Processing time on big data sets is computationally expensive and as the volume of queries grows the system will progressively drop in performance until the system fails. • 300 simultaneous requests will cause the web served to crash
User interface will be improved to be more aesthetically pleasing
Working with Dinosolve Input title Input protein sequence Input e-mail address Submit, then wait for confirmation... Protein Sequence: string of alphabetic characters, each of which represent a particular amino acid in the protein
Working with Dinosolve Confirmation of request Now wait for results
Working with Dinosolve Check your e-mail, Click the link provided The results are displayed
Dinosolve Issues As it continues to grow in popularity, these are expected to occur: • Hard resources for computation • CPU cycles • Memory • Disk space • Network bandwidth • Server crashes Goal is to prepare the system to be able to continue to support the research community in light of its expected growth in requests
Software • Unix operating system installed on the Dinosolve cluster • Dinosolve algorithm • Sun Grid Engine which will be our Distributed Resource Management System (D-RMS) installed on the cluster. • MySQL (database software) • Web-based user interface (website)
Hardware • MySQL database server • A computer cluster to run the Dinosolve algorithm • Web server for web-based user interface
How will we correct the problem? Configure a distributed resource management system
Competing Distributed Resource Management Systems • Sun Grid Engine (SGE) • Portable Batch System (PBS) • Load Sharing Facility (LSF)
508 compliance • Amended Rehabilitation Act of 1998 • require Federal agencies to make their electronic and information technology accessible to people with disabilities [32] • enacted to eliminate barriers in information technology, to make available new opportunities for people with disabilities, and to encourage development of technologies that will help achieve these goals [32]
Why is it important to be compliant? If an entity wishes to receive government funding then any electronic form the entity uses must be 508 compliant.
Objectives • Interpret and visualize current usage statistics • Configure, utilize, and optimize the SGE • Aesthetically pleasing and professional user interface
What benefits will come from attaining the goals? • Efficient utilization of available resources • Increased throughput of the cluster • An intuitive and professional user interface • Rise in popularity due to excellent accuracy, efficiency, and professional design
Conclusion With the updated user interface and correctly configured Sun Grid Engine, Dr. Li hopes to establish a reputable, reliable, and aesthetically pleasing Disulfide Bonding Prediction Server.
References for history • http://www-01.ibm.com/software/data/bigdata/ • http://en.wikipedia.org/wiki/Big_data • http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/ • http://en.wikipedia.org/wiki/Computer_cluster
References for case study 5. Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling Applications [Abstract]. National Science Foundation Award Abstract #1066471. 6. Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features. Biotechnology and Bioinformatics Symposium 7. bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013, from http://www.merriam-webster.com/dictionary/bioinformatics 8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry Dictionary: http://guweb2.gonzaga.edu/faculty/cronk/biochem/D-index.cfm?definition=disulfide_bond 9. Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx. 10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/
References for competition 11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011 URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF 12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF, PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf) 13. Dr. Li’s site http://hpcr.cs.odu.edu/dinosolve/ 14. Scratch Predictor http://scratch.proteomics.ics.uci.edu/ 15. DiANNAserver http://clavius.bc.edu/~clotelab/DiANNA/ Portable Batch System (PBS) 16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf 17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=1 18. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf 19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf 20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf 21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdf Moab HPC Suite 22.http://www.adaptivecomputing.com/publication/420/wppa_open/ IBM Platform LSF 23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDF Apache Hadoop with Zookeeper 24. http://zookeeper.apache.org/doc/current/zookeeperOver.html 25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf
Reference for 508 Compliance 26. http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973
Appendix • 40: Competition Matrix for Resource Management Systems • 41-43: 508.22 Compliance Statistics for Dinosolve
Competing Resource Management Systems Reference 19