DMTN-321

Process and tooling recommendations for future Data Releases drawn from the experience of Data Preview 1#

Abstract

Data Preview 1 was the first release of Rubin acquired, processed and published data. DP1 achieved its goals and was released on time and to great success; however, we have identified a number of concerns that should be resolved before bigger data releases are attempted.

This document is a proposal for process and tooling improvements to cover the lifecycle from the initial completion of Pipelines processing through to Rubin Science Platform publication of the data release.

Context#

Data Preview 1 (DP1) was the successful release of Rubin’s first end-to-end data products. While it met its goals (and a number of its stretch goals), the process of getting there was error-prone, planned contingency was exhausted, and a number of Data Management staff had to go into last-minute hero-mode to pull it off.

Important

This document can be considered a management-level lessons-learned/post-mortem for the purposes of creating a sound basis for future data releases. No criticism is attached to Data Management staff whose expertise, high standards and extra-mile work were key to DP1’s success.

Our aim is not just to achieve results, it’s to achieve results with a process that is efficient in staff time to preserve as much development time as possible. Sustained development is central to keeping our software and services state of the art. Delays, miscommunication, errors, human time spent on tasks that can be automated, and heroism to increase number of hours available in a week are all signs of a process in need of improvement.

We regard hero-mode as the aviation equivalent of a near-miss: the bad outcome did not occur, and barely anybody pays attention to it, but it’s a warning sign that should not be ignored.

In some cases, gaps in our process could only be mitigated due to DP1’s very small volume (a few percent of DP0.2), which allowed for numerous do-overs. This will not be the case for any future releases. Complacency on the basis of DP1’s apparent success is dangerous and we ignore its lessons at our peril.

Key Project-Level Recommendations#

The key project-level recommendations in this document are:

  1. DP1, as the first release containing real data and enabling astronomical discoveries, has already in a matter of days given us roadmap-altering insights into user query patterns that never emerged from DP0.2. It is essential to have at least one further release under the “Preview” (shared-risk) banner with science-driven user query patterns at a larger scale (minimum 300 sq deg), in order to have confidence we have properly rehearsed DR1 and have iterated on the data-release process and tooling improvements set out in detail in this document.

  2. We must ensure that adequate time is taken between DP2 and DR1 to ensure the data release process is as turn-key as possible and allow us to get on a sound footing of sustainable annual data release schedules. While community enthusiasm for DR1 is obviously sky-high, failure to adequately prepare for DR1 risks creating a snowball effect that puts all subsequent releases at risk. Data Services is thus strongly in favor of the proposed plan for an end-of-year-1 DR1 (as opposed to a 6-month DR1).

  3. No later than the month following DP2, the RSP hybrid model should be technically reviewed to determine whether it is sound in its present form and if not, any related risks should be exercised so that mitigations can be in place in time for DR1. This item also reinforces (1) and (2).

  4. In the context of the scope of this document, the Prompt Products release is entirely un-rehearsed and requires significant attention from Data Management to get right. Given the obvious high interest in the community for timely transient science, this release should be given equivalent priority (if not higher) than DP2. This item also reinforces (2).

  5. Agency or other sign-off should be sought to make even a ComCam-sized subset of the data “ok public” (if not actually public) to bypass restrictions that prevent us from shipping data for CI purposes to your cloud-hosted services.

For recommendations at the Data Management group levels and the reasoning for the key recommendations, see individual sections.

Problem summary#

The overall symptom of the problems identified was:

  • The interval between data production of DP1 and live release on the RSP took 4 months of elapsed effort, instead of the anticipated 3 months.

  • In the three month plan, there was 1 month of contingency which was fully exhausted.

The release was on time regardless due to heroic efforts, a minor fortuitous delay in First Light release, slipping some issues unlikely to be immediately impacting users till after going live, deferring non-DP1-related tasks, do-overs, and fixing bugs in the wild (mostly found by staff testing) that could have been caught ahead of time had contingency not eaten into the testing time.

A number of these interventions would have been ineffective for a DR1-scale event.

The eventual goal should be a 3-month worst-case release consisting of 2 months of release work including contingency, and 1 month of testing/bugfixing prior to release. We are unlikely to reach this goal in the next two releases.

Appendix: Technical-level recommendations#

The body of this document focuses on management shortfalls and organisational issues across departments.

This appendix contains recommendations that can be addressed entirely within Data Management.

Note

This section is work in progress below first draft left