Home > News > SDP eNews - November 2016

SDP eNews - November 2016

Summary

The SDP proceeds apace, implementing the management changes we outlined in our last SKA eNews contribution.

We planned our work for the October-November time period so that we could support the ongoing SKA engineering activities, and so that we could continue our work on the SDP Execution Framework, as well as work on the key algorithms that we will use to do all the processing within the SDP. We've also been conducting engineering workshops to make sure that we fully understand what makes designing the SDP such a challenge.

Stellenbosch meeting

In October we held a large meeting in Stellenbosch, South Africa, coinciding with the SKA Engineering meeting, to allow us to plan and bring in our new management structure.

As we mentioned previously, we have rewritten the SDP Work Breakdown Structure and changed how we tackle tasks, working in short ‘sprints’, focusing on a few specific projects that need to be completed within that timeframe as guided by project risk reduction priorities.

Over three days we discussed our progress in the August-September sprint and what we wanted to achieve in the October-November sprint. Based on this we decided on the objectives for the sprint and allocated work for the whole consortium.

One of the biggest benefits of meeting in Stellenbosch was being able to meet other members of SDP and connect with them face-to-face, improving our ongoing collaboration.

DALiuGE

In recent months some members of the team at ICRAR have started to do detailed testing of their Data Activated Flow Graph Engine: DALiuGE. It is a baseline implementation for the Execution Framework. The execution framework currently represents significant risk in the SDP design. Therefore, our Consortium is also assessing other options for implementing the Execution framework. Consequently, we are examining whether we can reuse the graph-based engine from MeerKAT, as well as Commercial Off The Shelf (COTS) implementations such as Spark, with the aim of  understanding how they might fit the SDP problem or how they might need adapting to do so.

The DALiuGE prototype includes a user interface for expressing complex data reduction pipelines along with a deployment, control and monitoring environment to optimise the execution of such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, like staff astronomers, operators, algorithm developers and experts in parallel computing. This separation allows them to collectively optimise the large-scale SKA data processing in a coherent manner. 

In order to ensure scalability to tens of millions of tasks running on thousands or tens of thousands of individual compute nodes, the execution  in  DALiuGE  is  ‘data-activated’,  where  each  individual  data  item autonomously triggers the processing on itself. 

Such decentralisation also makes the execution framework very robust and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world, Tianhe-2. 

Framework overhead

Figure 1 (Framework overhead): Measured pure framework overheads for small to medium scale deployments.

Red and yellow bars are showing the overheads per task (drop) in microseconds in the case where only one instance of deployment and control is utilised (1 island). An island allows us to group data together that are used by multiple processes. This allows us to scale out our processing on the vast amount of data generated by SKA.

The red bars are for a run on 30 compute nodes, the yellow ones for 60 nodes.

Note that the overhead per drop (task) is declining the more drops are deployed (x-axis).

This is due to the fact that the overhead is completely dominated by the constant time to setup the and launch framework itself.

The last two bars (blue and green) are showing the effect of running 1,054,086 tasks on 150 compute nodes, but using one (blue) or five (green) islands, respectively.

Large scale overhead

Figure 2 (large scale overhead): Since the number of tasks in the previous figure obviously not even saturated the scalability behaviour of even just a single island, we ran much bigger deployments as well.

This figure shows deployments of some 2 to some 12 million tasks on one (red) and five (yellow) islands, respectively.

The advantage of using multiple islands is clearly visible now and in fact it seems that a single island reached its saturation somewhere between 4 and 8 million tasks.

It should be noted that the absolute overhead per task of less than 10 microseconds is about an order of magnitude lower than what is required.

This series of tests had been executed on 400 compute nodes on the Pawsey supercomputing centre in Perth.

Regional Centres

The DELIV team within SDP has considered various options for how data will be made available to end users, but until this year there has not been a project wide view on how this should be done. Now we are working with the SKA Regional Centres (SRCs) Coordination Group (SRCCG) to better understand the data distribution needs of the SKA and finalise what the science processing sites need to provide.

For this, further consideration is being given to the Science Data Products (and associated metadata to enable SKA access policies to be enforced) SDP will produce and the software stack that will need to be deployed at the SRCs.

This must include tools for optimizing the transfer of the Science Data Products from the science processing centres to the SRCs and between SRCs.

International Virtual Observatory Alliance (IVOA) standards are being used to guide the tools that will enable a range of activities including querying the science and product catalogues distributed to the SRCs. The aim is to maximise the efficiency of the use of international Wide Area Networks and provide low latency access to services globally. The requirements analysis for the visualisation and analysis tools that will be required at the SRCs is being worked on by a team led by the SKAO.

OpenStack

OpenStack is an open source platform which connects and coordinates compute, storage and networking systems to create a coherent environment. This software-defined model provides a capability to offer cloud services to a range of workloads/applications within a typically virtualiszed framework.

As an orchestrator of  such workloads, OpenStack offers a unique framework for the SKA and the SDP. On the one hand, as a federated cloud environment OpenStack provides an opportunity for Regional Centres in the adoption of common software approaches to infrastructure management and diversity of application; on the other OpenStack provides a unique opportunity for the orchestration of the SDP execution environment and the complex interaction and integration of several sub-systems from data ingest to preservation coupled with real-time control. 

The SDP consortium, through activities sponsored by the University of Cambridge, is actively involved in both the development of OpenStack and promotion within the OpenStack Foundation. This included talks by Rosie Bolton on the SKA, Stig Telfer on capturing the state-of-the-art in High Performance ComputingPC and OpenStack and Paul Calleja on the use of OpenStack in Research Computing.

Rosie Bolton
Stig Telfer
Paul Calleja

Captions: UNIVERSITY OF CAMBRIDGE CONTRIBUTIONS TO THE AUTUMN 2016 OPENSTACK FOUNDATION SUMMIT (ROSIE BOLTON, STIG TELFE, PAUL CALLEJA)

Conclusion

We have submitted this eNews update just before the end of a work sprint. This means that we'll shortly be reviewing our progress, improving our processes, and selecting our priorities for December-January. We'll also be reviewing our risks; we hope to find that we have made steady progress in reducing the highest risks, so that we can turn our attention to the next set in our march towards our 2018 Critical Design Review.