High-performance computing (HPC) helps societies around the world develop new medicines, improve surgical outcomes, increase engine efficiency, design new green technologies, study the origins of our planet and solar system, and chart climate change, among countless other research breakthroughs.
As HPC resources have gotten bigger and faster through the years, HPC centres’ thirst for energy has also grown, and many of the world’s leading systems need enough energy to supply small cities. Over the last decade, that increasing need for energy has compelled the three centres comprising the Gauss Centre for Supercomputing (GCS)—the High-Performance Computing Center Stuttgart (HLRS), Jülich Supercomputing Centre (JSC), and the Leibniz Supercomputing Centre (LRZ)—to develop new, innovative approaches to keeping energy usage down, and supercomputing as sustainable as possible.
While large facilities such as these may take a long time to become energy neutral, staff at all three centres have made significant improvements over the last few years and have plans for further improvements moving forward. In recent years, the centres have been received rewards and international recognition for their commitments to sustainable supercomputing, but the centres are not satisfied. Each of the GCS centres approaches sustainability in their own unique ways, but all of the centres’ staffs have their eyes forward, looking for new ways to further reduce the centres’ carbon footprints and making the next generation of HPC more sustainable than the last.
Building a sustainable HPC centre requires more than just trying to lower the electric bill—GCS centres have been hard at work on large-scale, holistic approaches to improve efficiency through changes to computer architectures, cooling, building design, operations at the centre, application design, and waste heat reuse.
Many aspects of running an HPC centre require energy, but keeping systems cool is among the most challenging and expensive. As far back as 2008, LRZ staff became increasingly interested in how to more efficiently cool their supercomputer.
At that time, most large supercomputers used air cooling—that is, using fans and cool air to keep the computer cool. LRZ serves as an organizing committee member of European HPC Infrastructure Workshops, and staff participated in the first workshop in 2008, where they learned of early IBM research into new liquid cooling methods. Despite being near the end of the design phase for their next data centre, staff convinced leadership to make last-minute modifications to build a cooling infrastructure that, counterintuitively, uses warm water.
“There is a difference of three orders of magnitude between using water and air when it comes to cooling the machine,” said Dr. Herbert Huber, head of the high-performance systems division at LRZ. “We were able to take the forefront of energy-efficient supercomputing because we were first adopters of this chillerless direct warm water cooling method, where the water can use up to 50 degrees Celsius, and thankfully had strong support from our leadership to make sure that this happened.”
In recent years, the other GCS centres also adopted direct warm water cooling for their machines, with HLRS using warm water cooling on its latest system, Hawk, which came fully online this year, and JSC set to install warm water cooling infrastructure when its GPU-based booster module for the JUWELS supercomputer comes online later this autumn.
As part of the larger research centre Forschungszentrum Jülich (FZJ), JSC provides its resources to external users but also puts a lot of effort into computer architecture research and its implications for energy efficiency and sustainability. Before the Atos-ParTec-JSC-made system JUWELS came online, JSC worked closely with IBM on its prior generations of supercomputers. Based on the effect that using higher frequencies for their compute cores would lead to diminishing returns, these machines were using a large number of relatively slow nodes, marking JSC’s shift towards highly-scalable systems.
“What we learned with IBM was that if you lower the frequency, you are increasing efficiency,” said Dr. Thomas Eickermann, Group Leader of the communication systems division and Deputy Director at JSC. “If you double the frequency, performance will increase by less than two but the power will more than double.”
LRZ is following a different approach that relies on the same effect: In collaboration with LRZ, IBM developed an energy-aware scheduling method in which only applications benefiting from high processor frequencies are executed with frequencies higher than a defined default frequency. “We pursue this strategy for our current and future HPC systems and have evolved an application performance and energy model which is able to predict the optimal processor frequency to minimize the energy consumption of applications to run to completion,” said Mr. Huber.
In designing the novel modular computing architecture for its current machine, JUWELS, the centre not only kept this frequency issue in mind, but made two more architectural decisions that would have large implications for energy efficiency. For the Booster module specifically, JSC invested heavily in energy-efficient GPUs, allowing the system to get a large performance boost for the least amount of energy possible—the first JUWELS module achieved 12 petaflops of peak performance using 1.1 megawatts while the Booster is increasing peak performance to roughly 85 petaflops and it requires only about 2 megawatts.
Further, the modular architecture itself allows the machine to only work—and in turn, power—parts of the machine necessary for the jobs running on the system, as users with GPU-centric applications do not need to run on hybrid CPU-GPU nodes and leave the CPUs idling and vice-versa.
“Think of it as hiring people as part of a construction project,” Eickermann said. “A typical node might consist of 6 ‘decorators’ representing the CPUs and 2 ‘plumbers’ representing the GPUs. If I am building a museum, there is a lot of decoration to do, but relatively few bathrooms, so the plumbers will be idle most of the time. On the other hand, if I am building a hotel, there are is a lot of plumbing work to do and relatively little decoration. Our modular concept ensures that just the right number of people of each profession are hired and so we are limiting the amount of time our resource has idle cores running.”
A sustainable centre goes beyond just ensuring that a supercomputer and its associated infrastructure are efficient—it requires a centre-wide consciousness and plan that touches on all aspects of a centre’s operation. At HLRS, sustainability staff built a comprehensive plan to address a wide range of environmental issues.
Seeing a growing need for more sustainable supercomputing back in 2012, HLRS applied for funding through the Baden-Württemberg Ministry of Science and the Arts (MWK) to build out a formal sustainability team. The team surveyed the centre to identify areas for improvement, then implemented plans to address those issues. In the course of that work, HLRS decided to submit to rigorous audits that touched all aspects of the centre’s operations.
For several years, HLRS developed a plan to get all employees involved in sustainability efforts, made infrastructural changes to its cooling infrastructure to limit the amount of biocide needed to prevent microbes growing in the cooling infrastructure—a common problem with chillers in HPC and other IT infrastructure—and developed a comprehensive environmental mission statement for the centre.
These efforts were rewarded in 2019, as HLRS became the first large HPC centre to be certified under the Eco-Management and Audit Scheme (EMAS), a European-Union-developed system that is among the most demanding systems for organizational environmental management in the world. The centre also received certifications for environmental management under the ISO 14001 norm for energy management under the ISO 50001 framework. The internationally-recognized certifications ensure that HLRS has taken organization-wide steps to lower its environmental impacts and further improve sustainability across the centre.
“It was a challenge to take a project limited in people, time, and funding and turn it into something that touches the whole centre and also becomes a continuous effort,” said Dr. Norbert Conrad, Deputy Director at HLRS. “We are proud of our accomplishments with respect to these certifications, as an independent expert confirmed that we are a sustainable centre according to internationally recognized rules,” he said. “We intend to continue to educate our staff on all the ways to make sure HLRS remains operating as efficiently and sustainably as possible in the coming years.” HLRS recently published its “Practical Guide for Sustainability in HPC Centres” and was officially “Blauer Engel (Blue Angel) certified,” which is the German federal government’s ecolabel.
Finally, all three centres committed to getting electricity from renewable sources—both LRZ and HLRS use 100 percent renewable energy to power their organizations, and JSC sources more than half of its energy from renewable sources.
Environmental excellence in years to come
All three centres pointed to Germany’s relatively high energy costs and the overall heightened public awareness of sustainability as primary factors in Germany being a leader in sustainable supercomputing. While GCS centres are among the most energy-efficient supercomputing centres in the world, all three centres know that there are many opportunities to improve.
In anticipation of its next-generation machine, LRZ would like to expand its warm water cooling approach to not just the supercomputer itself, but also the network switches, servers, and other infrastructure in the computing room. Not only will this further lower the energy consumption in LRZ’s building, the lack of blowers and fans will make the room almost completely quiet—anyone who has toured a supercomputing facility knows that computer rooms are usually anything but that.
All three centres are also exploring plans and funding to see how the residual hot water being used in the cooling process can be reused in other ways.
LRZ uses the hot water to heat its buildings in winter and to operate adsorption chillers to produce cooling water for storage and other air-cooled IT components in summer. Adsorption chillers work much like how the human body cools off when it is too hot. Silica gel captures water vapor while water absorbs heat coming off of the computers’ racks. The water can get up to 56 degrees Celsius, heating the silica gel and releasing the water vapor trapped inside. Much like how we sweat when we get too hot, an evaporation process is responsible for producing cooling water while saving large amounts of energy normally used for compressor chillers or the need to constantly run fans.
HLRS uses the hot water to heat rooms in its all of its buildings, and is in intensive discussions with the University of Stuttgart on how that water could be piped to other parts of the campus to heat other buildings. As part of the Living Lab Energy Campus, a partnership between FZJ and the RWTH Aachen technical university, JSC plans to use its hot water to heat other parts of the FZJ campus and further develop and deploy intelligent energy management strategies, covering the full chain from electrical and thermal energy production, storage, distribution and usage.
The centres are active in sustainable HPC infrastructure workshops that include not only the GCS centres, but other HPC centres in Germany and other parts of Europe. These conferences allow sustainability experts to come together and share best practices on the many different aspects of environmentally conscious supercomputing.
Of the countless aspects pertaining to sustainability, though, Eickermann underscored one of the most important guiding principles for JSC, and indeed all three GCS centres—flexibility. “From today’s perspective, machines’ densities are increasing, and warm water cooling is gaining more ground, but if you look back more than 15 years when we built our last building, air-cooled systems seemed to be the future,” he said. “That means that if and when we build something new, we need to have a building and infrastructure that can be flexible and respond to changes.”