Search

Navigation and service

Training course "From zero to hero, Part I: Understanding and fixing on-core performance bottlenecks" @ JSC

begin
14.May.2019 09:00
end
15.May.2019 16:30
venue
JSC, Jülich

Modern HPC hardware has a lot of advanced and not easily accessible features that contribute significantly to the overall intra-node performance. However, many compute-bound HPC applications are historically grown to just use more cores and were not designed to utilize these features.

To make things worse, modern compilers cannot generate fully vectorized code automatically, unless the data structures and dependencies are very simple. As a consequence, such applications use only a low percentage of available peak performance. Hence, scientists have the additional responsibility to design generic data layouts and data access patterns. This gives the compiler a fighting chance to generate code that utilizes most of the available hardware features. Those data layouts and access patterns are vital to utilize performance from vectorization/SIMDization.

Generic algorithms like FFTs or basic linear algebra can be accelerated by using 3rd-party libraries and tools especially tuned and optimized for a multitude of different hardware configurations. But what happens if your problem does not fall into this category and 3rd-party libraries are not available? The training course sheds some light on achieving on-core performance.

We provide insights in today's CPU microarchitecture and apply this knowledge in the hands-on sessions. As example applications we use a plain vector reduction and a simple Coulomb solver. We start from basic implementations and advance to optimized versions using hardware features such as vectorization, unrolling and cache tiling to increase performance. The course also contains training on the use of open-source tools to measure and understand the achieved performance results.

Covered topics:

  • Inside a CPU: A scientists view on modern CPU microarchitecture
  • Data structures: When to use SoA, AoS and AoSoA
  • Vectorization: SIMD on JURECA, JURECA Booster and JUWELS
  • Unrolling: Loop-unrolling for out-of-order execution and instruction-level parallelism
  • Data Reuse: Register file and cache tiling
  • Compiler: When and how to use compiler optimization flags

This course is for you if you ever asked yourself one of the following questions:

  • What is the performance of my code and how fast could it actually be?
  • Why is my performance so bad?
  • Does my code use SIMD?
  • Why does my code not use SIMD and why does the compiler not help me?
  • Is my data structure optimal for this architecture?
  • Do I need to redo everything for the next machine?
  • Why is it that complicated, I thought science was the hard part?

The course consists of lectures and hands-on sessions. After each topic is presented, the participants can apply the knowledge right-away in the hands-on training. The C++ code examples are generic and advance step-by-step. Even if you do not speak C++, it will be possible to follow along and understand the underlying concepts.

In Part II of the course you will learn how to utilize these features in a performance portable way on multiple cores of a node. Furthermore, we will show how to use abstraction layers to separate the hardware-specific optimizations from the algorithm.