Institute of Theoretical Informatics, Karlsruhe Institute of Technology (Germany)
Local Project ID:
HPC Platform used:
JUQUEEN of JSC
Researchers of the Institute of Theoritical Informatics at the Karlsruhe Institute of Technology (KIT) investigate sorting algorithms on supercomputers that scale to the largest available machines and are robust with respect to input size, degeneracy and skew in the input. The main outcome is that four sorting algorithms cover the entire range of possible input sizes. At the same time, theoretical analysis provides performance guarantees and guides the selection and configuration of the algorithms. These theoretical hypotheses are validated using extensive experiments on 7 algorithms, 10 input distributions, up to 262144 processing cores, and varying input sizes over 9 orders of magnitude. For “difficult” input distributions, the new algorithms are the only ones that work at all.
Sorting is one of the most fundamental and widely used non-numerical algorithm. It is used to build index data structures, to establish invariants for further processing, to bring together “similar” data, and for many other applications. In particular, it is an important tool for data rearrangement and load balancing on supercomputers. This wide variety of applications means that we need fast sorting algorithms for a wide spectrum of inputs with respect to data type, input size, or degeneracy and skew in the distribution of input elements.
The subject of the research at the KIT is to systematically explore the design space of parallel algorithms for massively parallel machines and propose robust algorithms for the entire spectrum of possible inputs. The authors consider three major issues with respect to the robustness of a massively parallel sorting algorithm: Its scalability, i.e., its running time as a function of the input size n and the number of processors p, the impact of skewed input distributions where the location of the input data is correlated with key values (skewed inputs), and the impact of repeatedly occurring keys (degeneracy).
The motivation of the research is that parallel algorithms currently do not cover this spectrum of possible inputs for very large parallel machines. Although hundreds of papers on parallel sorting have been published, there is only a small number of practical studies that consider the largest machines with many thousands of processors (PEs). The few known studies mostly concentrate on very large random inputs. Most of these algorithms become slow (or simply crash) when applied to worst case inputs such as highly skewed or degenerate inputs. For example, the recently proposed multi-level sorting algorithm HykSort crashes for some degenerate inputs and becomes slow for skewed inputs due to huge load imbalances.
Even for random inputs, these algorithms are slow on small inputs, where their running time is dominated by a large number of message startup overheads and a large memory footprint. Note that sorting small inputs becomes important for the performance of parallel applications when it is needed in some frequently repeated coordination step between the PEs. For example, many applications perform load (re)balancing by mapping objects to space filling curves and sorting them with respect to this ordering  (see Figure 1). The scalability of the sorting algorithm may then become the limiting factor for the number of time steps we can do per second. Note that in this case it is of secondary importance whether the sorting algorithm itself is efficient compared to a sequential one - what matters is that it is faster than the local computations between rebalancing steps. Another extreme example is the operation MPI_Comm_Split which requires sorting with exactly one input element per process.
Practical Robust Massively Parallel Sorting
The researchers at the KIT systematically explore the design space of parallel sorting algorithms. The main outcome are four sorting algorithms which cover the entire range of possible input sizes [1,2,3].
The first sorting algorithm sorts while data is routed to a single processor (Gather-merge). This algorithm is very simple and performs only a logarithmic number of communication operations on the critical path. Its disdvantage is that the receiving processor constitutes a bottleneck so that the algorithm only works for very small inputs.
The second algorithm - “fast work-inefficient sorting (FIS)” is slightly more complicated and performs about 50 % more communication operations on the critical path. It eliminates bottlenecks by exploiting all processors in a uniform way. It is still inefficient in the sense that it performs many more element comparisons than a sequential algorithm.
The third algorithm (RQuick) is a parallel variant of the widely used quicksort algorithm. The number of communications on the critical paths grows proportional to (log p)2. It is efficient with respect to element comparisons but still inefficient because all the data is moved between processors a logarithmic number of times. This algorithm performs best for a wide range of small to medium input sizes on large supercomputers.
The last algorithm, AMS-sort works well for large inputs. It is a generalization of quicksort that makes a careful compromise between the number of times data is moved (k) and the number of communications on the critical path (kp1/k).
In general, for small inputs, communication latency dominates the performance so that successful algorithms for that case trade off latency against local work and communication volume. For increasing input size, communication volume and local work begin to dominate the running time and algorithms with larger latency but less local work and communication volume become interesting.
The researchers also devise new low-overhead mechanisms to make the algorithms robust with respect to skewed and degenerate inputs. E.g., FIS, RQuick and AMS-sort employ low-overhead tie-breaking to remove degeneracy. RQuick uses randomization and a fast but accurate splitter selection algorithm to handle skewed inputs, AMS-sort applies advanced data routing techniques to guarantee low latency even for worst case input (see Figure ). Additionally, the authors propose a new load balancing method to reduce the additional memory footprint of AMS-sort and sample sort algorithms in general (see also ).
At the same time asymptotic analysis provides performance guarantees and guides the selection of the algorithms (see Figure ). These theoretical hypotheses are validated using extensive experiments (see Figure ).
Michael Axtmann, Sebastian Lamm, Lorenz Hübschle, Dr. Christian Schuld, Daniel Funke
 M. Axtmann, T. Bingmann, P. Sanders, and C. Schulz. Practical massively parallel sorting. In 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 13-23, 2015.
 M. Axtmann and P. Sanders. Robust massively parallel sorting. In 19th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 83-97, 2017.
 M. Axtmann, A. Wiebigke, and P. Sanders. Lightweight MPI communicators with applications to perfectly balanced quicksort. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 254-265, 2018.
 P. Sanders, S. Lamm, L. Hübschle-Schneider, E. Schrade, and C. Dachsbacher. Efficient parallel random sampling - vectorized, cache-efficient, and online. ACM Transactions on Mathematical Software (TOMS), Volume 33, No. 3, pages 29:1-29:14, 2018.
 M. von Looz, C. Tzovas, and H. Meyerhenke. Balanced k-means for parallel geometric partitioning. In Proceedings of the 47th International Conference on Parallel Processing (ICPP), pages 52:1-52:10, 2018.
Scientific Contact, Principal investigator:
Prof. Dr. Peter Sanders
Institute of Theoretical Informatics
Karlruhe Institute of Technology
Am Fasanengarten 5, D-76131 Karlsruhe (Germany)
Project ID: hka17