Enhancing KFPA Data Analysis Pipeline with Python Tools

Bob Garwood- NRAO-CV KFPA Data Pipeline

History • Science and Data Pipeline Workshop – November 2007. Initial pipeline sketch. • Conceptual Design Review – February 2008. Initial design. • KFPA Data Analysis Meeting – June 2008. • Memo describing possible KFPA observing modes. Pisano, August 2008.

Changes since Conceptual Design Review • Basic design essentially unchanged • Out-of-scope items (deferred) • continuum • cross-correlation (polarization) • complicated calibration schemes (“basketweaving”) • baseline fitting added as an explicit step

Existing GBT Data Analysis Software • sdfits tool produces SDFITS file – associates raw data from a backend with meta data describing the observations. DCR, SP, Spectrometer. (data capture) • GBTIDL – recommended spectral line analysis tool. Focused on single spectra processing and analysis, not on imaging. Used to prepare the data to be imaged elsewhere. (calibration, editing) • AIPS is used to produce images.

We can reduce k-band data now • K-band spectrometer data calibrated and imaged using existing tools.

Missing components • None of the steps to an image are automated. • Uses lab-measured Tcal values. • Uses a scalar Tcal without regard to any structure in Tcal across the bandpass. • Cross-correlation (polarization) data is not supported after the sdfits step. • Poor support for continuum data.

Missing Components continued • Only have prototype tool for visually interacting with large amounts of data (e.g. visual flagging). • Only prototype tools for statistically flagging or editing the data (e.g. RFI rejection).

Goals of the Prototype Pipeline • Support KFPA commissioning • Explore new processing tools/techniques not yet widely available in GB (vector calibration, statistical data flagging and editing, visualization, parallel processing). • Prototype an automated pipeline – add necessary meta data to capture user intent • Prototype tools necessary to support larger focal plane array (e.g. parallel computing)

Goals continued • Based on prototyped tools, estimate cost associated with delivering a pipeline and necessary computing hardware to handle the expected data rates for a larger focal plane array. • Develop these tools and pipeline infrastructure for use with data from other backends.

Pipelines • Crude pipeline can be assembled from existing components for quick-look images. • Small modification to sdfits (data capture) to properly capture individual feed offsets from pointing position. • Some additional meta data to capture default image parameters and associated “off” information.

Pipelines • Imperative for large focal plane array. • large data rates and volume • Necessary for even a modest 7 element array. • Useful for data from other GBT backends • Users often end up creating partial pipelines • The NRAO archive needs this to be able to provide more than just the raw GBT data. • Other telescopes routinely provide roughly-calibrated data to their users – most institutions consider this the starting point of a data pipeline.

Pipelines • Requires using a standard observing mode. • Sufficient meta data needs to be captured to drive the pipeline (e.g. groups of scans that should be processed together, associated “off” information, etc). • Individual components can be used outside of the pipeline – often with additional options.

Pipeline • None of those steps is unique to the KFPA • KFPA-specific steps are likely as part of the statistical flagging and editing component as well as in data capture. • Components are being developed independently. • no dependencies between components • Some components are likely to be useful interactively – especially flagging and editing.

Pipeline Design continued • Eventually - Continuum data will be extract from the spectral line data at the appropriate point in the pipeline. This work is out-of-scope for the initial pipeline. • Language – python • Experience with python in Green Bank • Same language used in the ALMA pipeline and in casa.

Pipeline design, continued • Data formats • SDFITS up to imaging step. • Currently produced by data capture (sdfits) • Tools already exist to interact with this data. • May be necessary to split data into multiple SDFITS files for parallel computing needs. • Alternatives used as necessary – for speed or take advantage of existing tools – e.g. AIPS

Parallel Computing • Most of these steps are “embarrassingly parallel” - data from individual feeds can be processed independently • exceptions: some statistical flagging and editing and cross-correlation data – these are out of scope for the initial pipeline. • Parallel processing will be explored during KFPA pipeline development.

Development Priorities • Calibration • Complete GBTIDL vector Tcal and initial calibration database work. • Design pipeline calibration database. • Data Capture • This is the current bottleneck. Work is underway to improve the processing speed. A new raw data format may be necessary.

Priorities continued • Data capture (continued) • ensure that feed offsets are used properly with pointing direction to get individual feed pointings • put default calibration values into calibration database (GBTIDL model first, pipeline model when design completed). • Add appropriate meta information as necessary to automate data flow through the pipeline.

Priorities, continued • Pipeline design and implementation • Automate flow of data between existing compontents. • Initially this will be a simple script triggered off of the standard observing modes using default values and available meta information. • It will be possible to re-run the pipeline using alternative parameters (e.g. baseline fits, additional statistical flags, interactive flagging and editing, etc).

Priorities, continued • Data Visualization • Evaluate existing tools for viewing with and interacting with GBT data in sdfits form. • Data quality throughout the pipeline • Interactive flagging • Summer student project – 2008 – prototype data viewer. Can do interactive flagging, not sufficiently general.

Priorities, continued • Investigate simple parallel processing options • start with existing code (sdfits) • take advantage of independence of data from each feed • keep things simple

Priorities, continued • Statistical data flagging • Borrow from code developed by GBTIDL users • Borrow from aips++/casa autoflagger • Develop “basketweaving” equivalent for KFPA array. • Use (near) crossing points on sky (same feed; multiple feeds) to adjust data. • out of scope for initial pipeline development

Priorities, continued • Algorithm development (calibration, continuum data handling, etc). Roberto Ricci, U. Calgary.

Resources • Bob Garwood, NRAO – 1 FTE, component design and development • Robert Ricci, U. Calgary – algorithm development

Enhancing KFPA Data Analysis Pipeline with Python Tools

Enhancing KFPA Data Analysis Pipeline with Python Tools

Presentation Transcript

HMI Data Analysis Pipeline

Data Pipeline: Finance

Managing the NextGen data pipeline

Data Pipeline

Data-pipeline using ALSPAC data

Data Pipeline Project

Pipeline with Data Forwarding

GCOD Data Pipeline

IRAC Pipeline Data Analysis and Pipeline Validation Plan

BIRN Vision for Data Pipeline

Data Pipeline to Data Use

KFPA Monitor and Control Hardware

KFPA Data Pipeline

KFPA Noise Calibration Module and Coupler

A pipeline for fingerprinting data analysis

Data Pipeline Regional Training 2013

Data Pipeline to Data Use

Data Pipeline Project

Pipeline SIG PODS Pipeline Open Data Standard