In June 2017, The German Federal Ministry for Education and Research (BMBF) announced its Smart Scale strategy, setting the stage for the next decade of funding for Tier-0 German supercomputing and supporting the three Gauss Centre for Supercomputing centres—the High-Performance Computing Center Stuttgart (HLRS), Jülich Supercomputing Centre (JSC), and Leibniz Supercomputing Centre (LRZ).
BMBF and the state ministries of Baden-Württemberg, Bavaria, and North-Rhein Westphalia will provide funding for the next ten years. In addition, Germany will benefit from the European Commission’s EuroHPC initiative, which between industry, the European Union, and member states, will provide up to €1.5 billion to build exascale systems in Europe over the next 5 years.
The primary strength of GCS is rooted in its diversity. While providing access to researchers from many scientific disciplines, each centre has developed specializations rooted in staff expertise and complementary computing architectures. As Germany prepares for the pre-exascale era and the exascale horizon, the directors at the three GCS facilities have different focuses for their respective centres while working together to create a robust collaboration and knowledge exchange between the facilities. GCS sat down with the three centre directors to talk about their respective plans.
Prof. Dr. Dr. Thomas Lippert, Director of the Jülich Supercomputing Centre
• What are the advantages of JSC’s new modular supercomputer? How does it prepare JSC users for the kinds of challenges they will face as the HPC community moves toward exascale?
Modular supercomputing is a new paradigm directly reflecting the diversity of execution characteristics found in modern simulation codes, within the architecture of the supercomputer. Instead of a homogeneous design, different modules, strongly connected via a homogeneous global software layer, enable optimal resource assignment. We can then reach increased scalability and significantly higher efficiency with lower energy consumption, addressing both big data analytics and exascale simulation capabilities. Our users will benefit from this architecture in the long term, although initially some work on the code might be necessary. We support our users in the best possible way.
• What is significant about the transition to a hybrid architecture? How does JSC anticipate to leverage accelerator technologies moving forward?
The modular supercomputer works like a turbocharger: a booster module accelerates calculations on a cluster module. In other words, complex parts of the code that are difficult to calculate simultaneously on a large number of processors are executed on the cluster module. Simpler parts of the program that can be processed in parallel with greater efficiency—meaning the parts that are scalable—are transferred to the booster module. The booster module uses a large number of relatively slow but energy-efficient cores. We are convinced that we can also implement an exascale class supercomputer based on the modular concept.
• In what other ways is JSC supporting users during this transition? How has the focus of your support and training programs changed, and how do you see JSC staff and users more closely collaborating moving forward?
Even before we pursued the modular concept, we had already established a highly recognized support structure at the JSC. In particular, the SimLabs are firmly anchored in the respective research communities and work together with users to optimize their codes in order to use the system as efficiently as possible. In addition, we offer a broad portfolio of courses—some even tailored specifically to a community—that help researchers to make the most of using the architecture.
Prof. Dr. Dieter Kranzlmüller, Director of the Leibniz Supercomputing Centre
• The new SuperMUC-NG machine makes a major performance leap while maintaining a homogenous, CPU-based architecture. How does this benefit the LRZ user community?
Traditionally, we at LRZ have a broad user-base for our HPC systems—which is one of the reasons for the high demand of our computing resources from researchers in Germany and Europe. Hence, it was a priority for us to procure a system which will benefit many scientists. The general-purpose, CPU-based architecture, the high-performance Omni-Path 100Gb network, and the network topology are just some of the hardware features which make this possible. In combination with a corresponding software stack, we allow for an efficient workflow for the development and optimization of highly-scalable applications from different domains—be it astrophysics, geoscience, computational fluid dynamics, chemistry, or many others. At the same time, of course, SuperMUC-NG caters to the changing needs of scientists and has a lot to offer for researchers newer to the field of HPC, such as those from the life sciences or environmental researchers. The system is equipped to adequately address data challenges we see more frequently now in HPC. The newly introduced cloud component will deliver researchers the greatest possible freedom in using their own software and visualization environments to process the data generated by the supercomputer and to share these results with others. However, such a major performance leap will still make it necessary for our users to further optimize, adapt, scale their codes. LRZ experts will assist them, providing an interface between the various scientific communities and computer science. As part of the project we will again extend our user support team.
• A major theme in next-generation supercomputing is energy efficiency, and LRZ has been a leader in energy efficient supercomputing. How did energy efficiency play into the procurement of SuperMUC-NG, and what efforts are in place to ensure that future HPC systems continue to lead in energy efficiency?
Developing energy-efficient supercomputing systems is a societal and environmental imperative as well as an economic necessity. Therefore, indeed, energy efficiency was—next to the usability of the system—a major priority in the procurement process. The requirements we put into the tender resulted in a system that features very innovative technology: SuperMUC-NG will be direct warm-water cooled and allow for nearly 90 percent heat recovery. Using adsorption chillers, SuperMUC-NG will transform the remaining heat energy back into cooling for networking and storage components. To run the adsorption chillers under best conditions, temperatures will be even hotter than what our current-generation SuperMUC machine operates at up to 45°C for general deployments and for special projects up to 50°C. The machine relies on an advanced scheduling system as well as the Lenovo Energy Aware Runtime (EAR) software, a technology that dynamically optimizes system infrastructure power while applications are running.
• What aspects of hardware changes, human innovation, and knowledge transfer between users and LRZ staff are going to play the biggest roles for the next major increase in LRZ computing power?
When procuring, developing, installing a system of such a size, the devil is in the details. In fact, all of the mentioned aspects are important factors. The key is to make sure everything follows a well-planned strategy and is thoroughly integrated. We also have a team of experts from different areas that collaborate closely during the run-up to and throughout the procurement process. There are quite a few things already happening now. For instance, we have started planning for the building infrastructure; as mentioned, our research efforts in energy efficiency or performance optimization are on-going, as is the exchange between our users and staff. And, of course, we’re open to new technologies and will be bringing in a number of prototype systems before we start the official procurement process for the successor of SuperMUC-NG. This is our contribution to the German Smart Scaling initiative and represents another step toward exascale-class systems.
Prof. Dr. Michael M. Resch, Director of the High-Performance Computing Center Stuttgart
• What are some of the most important characteristics for the next-generation HLRS machine?
As part of GCS, HLRS takes responsibility for the supply of HPC resources and support for engineering applications. We are looking at systems that can provide high sustained performance for engineering applications. Usually that implies a certain level of main memory speed–both in terms of latency and bandwidth. This also includes a high-speed, closely coupled communication network. Another factor is a stable high-performance file system.
• HLRS closely collaborates with industry, ensuring supercomputing is accessible and fulfils the requirements of a wide variety of industries. How will HLRS ensure that increasingly large, complex HPC architectures are still usable by commercial and industrial partners?
First and foremost, HLRS has a focus on scientists and their needs for engineering applications. However, their needs coincide with the needs of many industrial applications. For industrial users there are two main problems with large scale complex computer systems. First, there is a need to get independent software vendors (ISVs) to help create software that scales well enough to make good use of HPC resources. We have partnered in the past with ISVs in order to get codes to scale much better and have had some impressive successes. Second, industrial users normally have little experience with such complex computer systems. Over the past several years, we have created a training program that tackles this problem. In a large scale European project, we aim to extend the reach of this training program to also include industrial users on a much wider scale than before.
• How is HLRS’ training and support programs this year preparing users for a transition for a new machine in 2019, and how will HLRS continue to improve the collaboration between its users and staff?
We have started to tailor our training program toward large scale systems. This includes adding courses on new programming languages like CUDA and PGAS languages. It also includes programming special purpose hardware. However, each system is so special that as soon as we have made a decision for our next generation system we will work together with the chosen vendor to provide our users with specialized training for the new system.
Prof. Resch, as current Chairman of the GCS Board of Directors, hosted a special session on Tuesday, June 26 at ISC18 in Frankfurt, Germany titled, “German HPC in Context,” going into greater detail about the German HPC strategy.
-Eric Gedenk, firstname.lastname@example.org