The SDP Kernel Workshop, held in Cambridge on 29 November 2016, allowed key partners from industry to discuss algorithmic efficiency with radio-astronomy experts, looking at the Gridding algorithm.

The kernel we have selected is gridding, because:

- The memory bandwidth and processing requirements are very demanding
- The data processed has domain specific characteristics
- It is affected by numerous parameters

We separate the work that should be undertaken into the following sections:

- Development of algorithms
- Study of optimal data layout for cache re-use and processing
- Alternative implementation of algorithms and data layout (e.g. sparse FFT)
- Organization of the parameter space
- Demonstrating the use of tools for profiling
- Reporting the behavior for implemented algorithms
- Description of the goals

**For more information please visit the GitHub repository on this work.**

## Frequently Asked Questions

A collection of clarifications to the work description

## What is the role of gridding in the pipeline?

Gridding is currently projected to be one of the most expensive operations. We estimate that at least ~2-4 Pflop/s of the SDP's compute capacity will have to be spent on gridding. What makes gridding especially interesting for the current pipeline planning is that it will become more important the more the telescope is scaled up, as naive scaling will lead to O(n^4)complexity in gridding and kernels. There will obviously be other factors, but it looks like a pretty safe bet that sooner or later gridding performance will become very important.

Where does this much compute come from? To give a sense of scale, a data set described above reflects:

- 45 seconds (x480 for 6 hours)
- 1 polarisation (x16 for all Mueller matrix terms)
- 7.5 MHz (x40 for full Low band)
- 1 facet (x25 for full FoV coverage to third null of antenna reception)
- 1 taylor term (x5 for recovering frequency dependence)
- 1 major loop (x10 for 10 self-calibration iterations)

So we need to process a data set like this roughly 384 million times before a full observation is processed. The data sets will vary in terms of visibilities and w/A-kernels, but the uv-grid distribution will mostly stay the same.

## What about weighting and deconvolution?

You might see these pop up in the reference implementation, but it is expected that those will not be relevant.

## Kernels switch fast. Do they have to be loaded from memory?

Visibility count per kernel depends on the baseline, and note that generally baselines with many visibilities switch kernels more slowly. However, it is correct that (A-)kernels get switched extremely fast. After all, they depend not only on baseline, but also on time and frequency, so we get at minimum ~40 kernel switches per baseline even ignoring w-kernels.

It seems highly desirable for this reason to do convolution on-the-fly at minimum. As argued in SDP Memo 028, this should be enough to ensure that the kernel is not memory-limited on kernels in most situations. There are also more advanced algorithms which might have advantages in special situations, but this should be a good starting point.

Generating w/A-kernels on-the-fly would be possible as well. However, note that generation is generally not *that* expensive in the grand scheme of things: It only needs to be done per individual antenna/w-value, whereas convolution needs to be done for every antenna-pair/w-value *combination*. For example, the A-kernel and w-kernel sets are roughly 300 MB combined here, but all required kernel combinations would be ~4TB!

## What is the purpose of the C gridding code in examples/grid?

This code was written as a test-run to see how hard it would be to work with the HDF5 data set from a C-like environment. The data access code has been tested and should be correct.

However, while the gridding portions has been used for benchmarks, it has not been tested as thoroughly. We have verified that the results are correct for simple and w-projection imaging, but for Aw-projection the program is missing actual convolution. So these portions of the program should be interpreted as illustrations, not as a reference.

## How can convolution be implemented using FFTs?

If you look at the definition of *aw_kernel_fn*, you will notice that it is implemented as follows:

akern = scipy.signal.convolve2d(a1kern, a2kern, mode='same')

Where *concolve2d *implements a "naive" convolution algorithm. However, using FFT laws we can replace this by Fourier transformation. If we want to implement the original convolution exactly this would boil down to:

awkern = 29*29*extract_mid(

numpy.fft.fftshift(numpy.fft.fft2(

numpy.fft.ifft2(numpy.fft.ifftshift(pad_mid(a1kern,29))) *

numpy.fft.ifft2(numpy.fft.ifftshift(pad_mid(a2kern,29))) )),

15)

Note that we need to pad the kernels to get the right border behaviour. However, the kernels for Aw-gridding will normally be chosen to /not/ overflow the borders (e.g. padding should have happened when the kernels get generated). Therefore, the computationally simpler "wrapping" approach is permitted as well:

awkern = 15*15*numpy.fft.fftshift(numpy.fft.fft2(

numpy.fft.ifft2(numpy.fft.ifftshift(a1kern)) *

numpy.fft.ifft2(numpy.fft.ifftshift(a2kern)) ))

Note that the inner FFTs of the input kernels are repeated quite often, and should be shared as much as possible to reduce computation.