1 / 10

Designing and Implementing Processing Pipelines with Conductor: The HiROC Experience

Designing and Implementing Processing Pipelines with Conductor: The HiROC Experience Bradford Castalia Systems Analyst Planetary Image Research Laboratory HiRISE Operations Center University of Arizona Tucson, Arizona. Pipeline Processing Pipeline Processing

alanna
Télécharger la présentation

Designing and Implementing Processing Pipelines with Conductor: The HiROC Experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing and Implementing Processing Pipelines with Conductor: The HiROC Experience Bradford Castalia Systems Analyst Planetary Image Research Laboratory HiRISE Operations Center University of Arizona Tucson, Arizona

  2. Pipeline Processing • Pipeline Processing • Conductor is a Java application for managing queues of source files to be processed by sequences of procedures. • Procedures • Defined in a database table by sequence number • Data processing procedure • Success criteria • A procedure must be successful for the next to run • On-failure (branch) procedure • Sources • Defined in a database table by source number • Source file pathname • Log file pathname • Procedure status values • Will be processed by one and only one Conductor

  3. Pipeline Processing • Database • Procedures and Sources tables are paired • Multiple Conductor instances use the same database • Multiple Conductor instances for the same pipeline • Configuration • Based on ISO standard PVL • Configuration files may be shared • Configuration files may be included (e.g. site config) • in other configuration files • Environment variables are included • Conductor maintained parameters • Reference Resolving • Configuration parameter references • Database field references • Nested references • Expression evaluation

  4. Science Teams and Ops Staff Public HiReport HiEST HiCat HiWeb Pipelines PDS Products HiVali DOM Eng RSDS HiDOG EDRgen RDRgen HiArch Conductor ISIS HiSPICE Host OS Environment Downlink Data Flow

  5. RSDS Raw Data Repository HiDog Pipeline EDRgen Pipeline EDR_Stats Pipeline HiCal Pipeline WatchDog Check data availability HiStitch Pipeline Standard Data Products HiCat Database Full-Res Color RDR Full-Res Red RDR EDR EDR Table RDR Table Geometry Table HiccdStitch Pipeline Validation SPICE Pause Validation & Release RedGeom Pipeline RedMosaic Pipeline Internal Products (JPEG2000) RDRgen (JPEG2000) Pipeline HiGeomInit Pipeline ColorMosaic Pipeline ColorGeom Pipeline NAIF Node SPICE Repository SPICE HiSPICE

  6. Initiate and Data Download • FEI_Watchdog • Poll the data delivery server (RSDS) • Register the download file • Pipeline_Source • Fetch and prep the data file • Download the file from the server • Notify operators on failure • Only continue if configured to do so • perl -e ‘exit ${Continue_Status};’ • Move the file and update the Source_Pathname • Register the file in the next pipeline

  7. EDR Production and Metadata Collection • Check for multi-channel data file • Break out channel files and register new sources • Generate EDR product file • PVL_to_DB map of PDS label parameters • to HiCat EDR_Products record field values • Replace existing record if configured to do so • RDR and Extras Production • Photometric processing • Geometric processing • Collect all channel files for the observation • before registering them in the next pipeline • Use mutilple systems in parallel • for compute-intensive processing • Reprocessing

  8. Management Issues • Incremental pipeline development • The ability to grow the network of pipelines • without inherent ripple effects is very important. • Splitting and merging pipeline segments can be • done at will. • Testing of pipeline segments or portions of a network • can be done in sandbox environments, including individual developer or user contexts, separate from the production environment without the need for a complete production configuration yet exactly mirroring the production configuration and operations. • Adaptable to the level of demand • Conductor instantiations can be added or removed from pipeline processing at any time. • Error tolerant • Each Conductor acts independently.

  9. Hardware System Design Issues • Network bandwidth • Consider all possible sources • Network overload can cause hardware switches to fail • Foundation (generally not incremental) • CPUs • Services: database, web, e-mail • Compute engines: add as needed • Data storage • Fast, local space; especially /tmp • Bulk, shared space: add as needed • NFS latencies

  10. Future Development: • PostgreSQL • New Data_Port being integrated for distribution Composer • Interactive Procedures table definition • Add, remove and reorder procedures • Edit procedure definition fields • Test reference resolving • Maestro • Manage multiple Conductors • Local or remote • Start, suspend/resume, stop • Monitor logging streams • Report throughput and backlogs of Sources • Accumulate resource utilization metrics

More Related