SDP Consortium Progress since last eNews submission
M20 - Pre-CDR Preparation
Since the last eNews submission the SDP has successfully submitted and certified its last milestone, M19, and now is busy preparing for the next milestone, M20, Pre-CDR. The M20 Release Readiness Notice (RRN) was submitted in early March and contains 19 documents within the submission pack. This list includes Systems Engineering documents, Interface documentation (ICDs) and an updated snapshot of the latest SDP Architecture views since the M19 submission.
This pre-CDR milestone is a stepping stone to a successful CDR delivery in October. An incremental submission allows valuable key stakeholder feedback to be gained in particular on the architectural design to ensure continuous improvement and alignment of architectural priorities.
In addition to the finalisation of deliverables for Pre-CDR submission, other areas of focus for the Consortium are - further iterations of the SDP software architecture, advances in the SEI views of the high-level SDP architecture, continued understanding of the SDP interfaces, progress on the SDP functional model using the Algorithmic Reference Library (ARL), progress on the SDP prototyping testbed (P3) in Cambridge, progress on the SDP Integration Prototype (SIP) including end-to-end testing, consolidation of data models work, monitoring of the hardware costs evolution and updates to the corresponding predicted cost to name a few!
The sections below expand in further detail the recent work efforts and progress in the areas of platform and systems integration prototyping (P3, SIP) as well as the ARL. This work reduces risk and provides the rationale for design choices ahead of System CDR.
SIP – SDP Integration Prototype
Since the last eNews submission, the SDP integration prototype (SIP) team have been focusing producing prototype implementations of the SDP execution control components, and continuing development of a number of processing pipelines which will eventually be used to test an end-to-end deployment of the system on the P3-AlaSKA prototype hardware platform later this year.
Execution Control components are being developed as a set of containerised microservices, and since January we have produced new or updated prototype implementations of the Master Controller, Processing Controller TM Interface, Processing Controller Scheduler, Configuration database, and Processing Block Controller services. These represent the key components responsible controlling the SDP when it comes to the execution of Processing Blocks, the unit of scheduling of data processing with the SDP system.
We have also been continuing to develop a set of representative processing pipelines, namely a visibility receive and real-time ingest pipeline, an imaging and calibration ‘ICAL’ pipeline, and the ‘MAPS’ pipeline, which is a polarised counterpart to the LOFAR MSSS Survey. These pipelines make extensive use of the SDP algorithm reference library (ARL) and the Dask parallel computing library, and are being developed and tested on the tested on the SDP P3-AlaSKA cluster.
For the SIP MAPS pipeline, we have now completed a serial version that uses ARL and implements new cutting-edge algorithms for processing radio polarisation observations. This serial code is available on GitHub. Parallelisation of the code is underway using Dask, with expected completion in the next software sprint.
In order to run this pipeline against a full SIP service stack, we have also begun prototyping the use of queues to ingest and process Scheduling Blocks using ARL and to aggregate QA information. This work is ongoing using Confluent Kafka on P3-AlaSKA, with an initial code expected early next sprint. We have also been investigating deployment of the SIP MAPS pipeline within Docker containers, and intend to also extend this to the deployment of Kafka Brokers with Docker.
Figure 1: A single channel image of a bright radio pulsar as seen in continuum images generated by the SIP MAPS pipeline. The data are from the polarised counterpart to the LOFAR MSSS survey. The full dataset comprises 50TB of uv-data.
P3 - ALaSKA
The Performance Prototype Platform and AlaSKA software-defined environment is now being used extensively to support several Prototyping activities in terms of SIP work (as above) as well as activities in support of Buffer function, SDP and Platform Services and Execution Frameworks.
Of particular note, since last reported, P3-AlaSKA now provides a number of improvements to the storage infrastructure: with 10GbE interfaces to the SoftIron Ceph appliance that provides a persistent store for home directories as well as a temporary large storage pool; GlusterFS and BeeGFS ansible-driven playbooks for provisioning of disaggregated storage nodes based on NVMe and spinning-disk servers (Cold and Hot Buffer); as well as playbooks for converged BeeGFS configurations using local compute node SATA-SSDs. On-going work is currently investigating Object Storage by the NZA and P3-AlaSKA teams. All storage systems are now available via Mellanox EDR with access either via RDMA or IPoIB.
In terms of networking, the 25GbE network (Bulk Data or High Throughput Ethernet Network) is now in operation through extensive work with Mellanox and the procurement of passive adaptors. This network can be configured either in RDMA or standard TCP/UDP protocols. Work planned in Sprint 2018B with the SIP team will explore SPEAD packet performance across multiple racks as part of Receive and Realtime Processing, together with writing to Cold Buffer.
In terms of Execution Frameworks, the P3-AlaSKA team has worked with the Ohio State HiBD team to resolve issues around the support of RDMA-enabled Spark. This is now available as an OpenStack Sahara service.
Work has also progressed on improvements to the Monitoring and Logging of the infrastructure and applications by use of the OpenStack Monasca service. This is one of the subjects covered in the latest SDP Webcast (see link) and provides users with insight into performance with the ability to define user-centric metrics. The Grafana dashboards shown below provide examples of output from Monasca.
Figure 2: FIO output for a 2-node cluster, showing CPU, memory, swap and network usage with client-side caching.
Figure 3: Output of Master and 2 Slave SPARK Cluster Showing CPU, Memory and RDMA Network Consumption
ARL - Algorithmic Reference Library
The SDP has been working on the implementation of an ARL (Algorithmic Reference Library) designed to present imaging algorithms in a concise Python-based form. The primary reason for implementation of such a reference library to be easily understandable to people not familiar with radio interferometry imaging. In addition, the ARL creates a testbed for experimentation, publication, and conduit for algorithms into SKA. Further details can be referenced at the following website: http://www.mrao.cam.ac.uk/projects/jenkins/algorithm-reference-library/docs/build/html/ARL_background.html
The categories of algorithms covered by the ARL are simulation, calibration, visibility plane, visibility plane to/from Image plane and Image plane. A list of the currently available functions is at the following link: http://www.mrao.cam.ac.uk/projects/jenkins/algorithm-reference-library/docs/build/html/ARL_definition.html#arl-api. A summary of the core data models and the related functions is shown below:
Figure 4: Core data model and associated functions.
ARL has been used as a platform for exploring graph processing and the application of graph processing in calibration and imaging. Graph processing of interest to SDP as a possible execution framework. To construct and execute graphs constructed from ARL functions, the python library Dask http://dask.pydata.org has been used. Functions to be executed via a graph are wrapped using the dask.delayed function and then executed using one of a number of different schedulers: threaded, multi-processing, and distributed. The diagram shows a continuum imaging pipeline running on an 8-core desktop machine. Our tests so far have explored scaling from a laptop with 8 workers up to a cluster with 256 workers, running graphs with up to 100,000 nodes.
While Dask is a good foundation from which to commence this exploration the Consortium is looking at ways to expand testing to include other execution frameworks that have improved performance capabilities of the scale required by the SDP.
Recent work in ARL has addressed calibration algorithms. SageCAL and full bandpass calibration algorithms have been added to the library. Facet calibration and peeling are schedule for the next sprint. Other work includes adding a Quality Assessment pipeline which will then use either python queues (for the prototype) or, eventually, Kafka. There are also plans to include Non-Imaging Pipeline (NIP) functionality into the ARL and as such planning activities will commence in the next Sprint to incorporate this work.
Figure 5: ARL executing a continuum imaging graph using Dask on a 4 core laptop running 8 workers. The different functions being executed are colour coded.