130 likes | 235 Vues
Umpire is an automated tool for detecting errors in MPI programs, preventing issues like deadlock and resource errors. It offers dynamic software testing, shared memory implementation, and verification algorithms to enhance MPI program safety.
E N D
Bronis R. de Supinski and Jeffrey S. VetterCenter for Applied Scientific ComputingAugust 15, 2000 Umpire: Making MPI Programs Safe
Umpire • Writing correct MPI programs is hard • Unsafe or erroneous MPI programs • Deadlock • Resource errors • Umpire • Automatically detect MPI programming errors • Dynamic software testing • Shared memory implementation
MPI Application Umpire Manager Task 0 Task 1 Task 2 Task N-1 Task 0 ... Task 1 Task 2 Task N-1 Interposition using MPI profiling layer Transactions via Shared Memory Task 0 Task 1 Task 2 Task N-1 ... MPI Runtime System Umpire Architecture Verification Algorithms
Collection system • Calling task • Use MPI profiling layer • Perform local checks • Communicate with manager if necessary • Call parameters • Return program counter (PC) • Call specific information (e.g. Buffer checksum) • Manager • Allocate Unix shared memory • Receive transactions from calling tasks
Manager • Detects global programming errors • Unix shared memory communication • History queues • One per MPI task • Chronological lists of MPI operations • Resource registry • Communicators • Derived datatypes • Required for message matching • Perform verification algorithms
Configuration Dependent Deadlock • Unsafe MPI programming practice • Code result depends on: • MPI implementation limitations • User input parameters • Classic example code: Task 0 Task 1 MPI_Send MPI_Send MPI_Recv MPI_Recv
Mismatched Collective Operations • Erroneous MPI programming practice • Simple example code: Tasks 0, 1, & 2 Task 3 MPI_Bcast MPI_Barrier MPI_Barrier MPI_Bcast • Possible code results: • Deadlock • Correct message matching • Incorrect message matching • Mysterious error messages
Deadlock detection • MPI history queues • One per task in Manager • Track MPI messaging operations • Items added through transactions • Remove when safely matched • Automatically detect deadlocks • MPI operations only • Wait-for graph • Recursive algorithm • Invoke when queue head changes • Also support timeouts
Task 0 Task 1 Task 2 Task 3 Deadlock Detection Example Barrier Barrier Barrier Bcast Bcast Bcast Barrier Task 1: MPI_Bcast Task 0: MPI_Bcast Task 2: MPI_Bcast Task 2: MPI_Barrier Task 0: MPI_Barrier Task 3: MPI_Barrier Task 1: MPI_Barrier ERROR! Report it!
Resource Tracking Errors • Many MPI features require resource allocations • Communicators, datatypes and requests • Detect “leaks” automatically • Simple “lost request” example: MPI_Irecv (..., &req); MPI_Irecv (..., &req); MPI_Wait (&req,…) • Complicated by assignment • Also detect errant writes to send buffers
Conclusion • First automated MPI debugging tool • Detect deadlocks • Eliminates resource leaks • Assure correct non-blocking sends • Performance • Low overhead (21% for sPPM) • Located deadlock in code set-up • Limitations • MPI_Waitany and MPI_Cancel • Shared memory implementation • Prototype only
Future Work • Further prototype testing • Improve user interface • Handle all MPI calls • Tool distribution • LLNL application group testing • Exploring mechanisms for wider availability • Detection of other errors • Datatype matching • Others? • Distributed memory implementation
UCRL-VG-139184 Work performed under the auspices of the U. S. Department of Energy by University of California Lawrence Livermore National Laboratory under Contract W-7405-Eng-48