Leveraging Database Technologies in Condor

Jeff Naughton March 14, 2005 Leveraging Database Technologies in Condor

Overview • Introducing ourselves • How we got involved • What we are doing and what we hope to do • Request for input

Who we are • Faculty: David DeWitt, Jeff Naughton • Students: Jiansheng Huang, Ameet Kini, Christine Reilly, Eric Robinson, Srinath Shankar, Lakshmikant Shrinivas

Wisconsin DB Group • A world-leading DB research group for over 20 years. • Strong presence in: • Research publications. • Grads on faculty at top schools (Berkeley X 2, Cornell X 2, CMU) • Grads at top industrial DB research centers (IBM Almaden, MS Research) • Grads in development organizations of main DB companies (IBM DB2, Oracle, MS SQL Server) • History of influential software artifacts (WiSS, Gamma, Exodus, SHORE, Paradise)

So how did we get to Condor/Paradyn week? • 4th floor of CS building: 4361 Naughton, 4367 DeWitt, 4369 Livny (adjacent offices!) • Miron was very persuasive. His algorithm: • Enter our offices. • Describe some challenging and interesting data management problem Condor faces or will face. • Leave office, get on airplane. • Return to Madison, go to 1.

Why Condor and DBMS? • Premise: A running Condor system is awash in data: • Operational data • Historical data • User data • DBMS technology can help capture, organize, manage, archive, and query this data.

Three potential levels of involvement • Passively collect and organize data, expose it through DB query interfaces. • Move/extend some data-related portions of Condor to DBMS (Condor writes to and reads from DBMS) • Provide services to help users manage their data.

Why do this? • For Condor developers: • Easier to trouble shoot and debug the system; • Easier to implement new functionality; • Less time hassling with data management issues; • Power of declarative data management language. • Easier to make data management aspects of the system scalable; • Leverage 25 years of DBMS research on scalable data management.

Why do this? • For Condor administrators • Easier to analyze and trouble shoot; • Easier to audit; • Easier to explore current and past system status and behavior.

Why do this? • For Condor users: • An ever-improving system due to more productive developers and administrators. • Easier to monitor and understand performance of their jobs. • Easier to analyze history of their use of the system. • Complete record of every job they have submitted, and everything that happened to every job while it was running. • Support for detailed data lineage queries. • Data management facilities to assist them in handling large, complex, inter-related data sets.

Our projects and plans • Quill: Transparently provide a DBMS query interface to job_queue and history data. [ready to deploy!] • CondorDB: Transparently captures and provides interface to critical data from all Condor daemons. [status: partial prototype working in our own “sandbox”]

Longer-term plans • Tight integration of DBMS technology and Condor [status: thinking hard!]. • DBMS-inspired data management services to help Condor users manage their own data. [status: thinking really hard!]

Why doesn’t Condor currently use DBMS technology? • Simple answer: Condor and DBMSs “grew up” together. • Condor project started 1986. • Postgres project started 1986. • Now both are ready for each other.

Project 1: Quill • Non-invasive approach to capturing job related information • Works by sniffing updates to the job queue log • Serves condor_q and condor_history queries • Independent, reliable, and efficient querying of job related information So how does it work?

Quill Architecture Master Startd … Schedd Quill Store events Write events Get new events RDBMS Queue + History Tables Job Queue log

Querying Job Related Information Master … Startd Schedd Quill Querying an already busy schedd!! condor_q queries Independent and a more powerful query functionality RDBMS condor_q++ queries

Quill benefits • Robustness: Monitored by master just like other condor daemons – resilient to failure • Independence: Not in critical path of any other condor daemons • Performance: Derive benefits of SQL to serve job related queries an order of magnitude faster • Functionality: A broader range of queries • Extensibility: Easy to add more complex queries • Downside: only handles job queue and history data.

Project 2: CondorDB • CondorDB is a passive approach to capturing operational data in a condor pool • Modified daemons log events to the database at run time – no log sniffing • Central database serves entire pool • Web-based query GUI

Schedd Schedd Shadow Startd Database Starter Negotiator A Machine Data Capture in CondorDB • Condor daemons augmented to record important events in a database • Database is in addition to standard daemon logs • Pool will run unaffected even in the absence of a database

Database CondorDB User Interface • Users can access Condor through a web-interface • Job queue, job history, machine info, match and reject info, aggregates and summaries, etc… • The web server queries the database with PHP

Users see only their own job information

Users see only their own job queue on a shared machine

Drill-down to get detailed job information

Matchmaking data at your fingertips Matches Rejects

Machine information in a single central repository

The data-centric approach makes many tasks easier • Privacy enhanced by presenting user with queue/history information about her jobs only • Intuitive “drill-down” navigation to get increasingly detailed information • All information about a job from submit-time until present available from a single screen • Useful summary information presented in tabular and graphical format • Optionally query database directly for ad hoc information on job queue, job history, matchmaking and file usage

Acknowledgement • The Condor team has been wonderfully responsive and supportive throughout this effort.

Demos! • Come see demos of Quill and CondorDB in room 4360 CS on Wed. afternoon.

Virtuous Cycle • As we learn where Condor can use DBMS technology, we also learn where DBMS technology can be (must be?) improved. • Support for dynamic-schema sparse data sets. • Extreme requirements of self-installation and self-maintenance. • Pushing match-making style operations into DBMS. • Improving DBMS technology will lead to more places that it can be installed.

Request • We want your input! • We have a lot of ideas but want to filter, modify, and augment them through the benefit of your experience. • Send mail to naughton@cs.wisc.edu anytime.

Leveraging Database Technologies in Condor