Success Story: Full Airplane Simulations on
Heterogeneous Architectures

Success story # Highlights:

  • Keywords:
    • Dynamic Load Balancing/DLB
    • parallel performance
    • airplane simulations
    • GPU computing
    • Co-execution
    • heterogeneous computing
  • Industry sector: Aeronautics
  • Key codes used: Alya
Alya2
Snapshot of Q iso-surfaces of the turbulent flow around an airplane.

Organisations & Codes Involved:

Barcelona Supercomputing Center-Centro Nacional de Supercomputación (BSC-CNS) is the national supercomputing centre in Spain. BSC specialises in high performance computing (HPC) and manages MareNostrum IV, one of the most powerful supercomputers in Europe. BSC is at the service of the international scientific community and of industry that requires HPC resources. Its multidisciplinary research team and computational facilities – including MareNostrum – make BSC an international centre of excellence in e-Science.

Alya is a high performance computational mechanics code to solve complex coupled multi-physics / multi-scale / multi-domain problems, which are mostly coming from the engineering realm. Among the different physics solved by Alya we can mention: incompressible/compressible flows, non-linear solid mechanics, chemistry, particle transport, multiphase problems, heat transfer, turbulence modeling, electrical propagation, etc. Alya is one of the only two CFD codes of the Unified European Applications Benchmark Suite (UEBAS) as well as the Accelerator benchmark suite of PRACE.

scientific Challenge:

Many of the future Exascale systems will be heterogeneous and include accelerators such as GPUs. With the explosion of parallelism, we also expect the performance of the various computing devices to be more variable and, therefore, the performance of the system components to be less certain. Leading-edge engineering simulation codes need to be malleable enough to adapt to the new environment. For Alya, EXCELLERAT’s reference code that is used for modelling complex systems, like airplane simulations, dynamic load balance mechanics are required to adjust the workload distribution to the measured performance of each component of the system.

Solution:

As a solution, in EXCELLERAT we use dynamic load balancing (DLB) to increase the parallel efficiency for airplane simulations, minimising idle time of underloaded devices at synchronisation points. Alya has been provisioned with a distributed memory DLB mechanism, complementary to the node-level parallel performance strategy already in place. The kernel parts of the method are an efficient in-house Space Filling Curve (SFC)-based mesh practitioner and an online redistribution module to migrate the simulation between two different partitions. Those are used to correct the partition according to runtime measurements. We have focused on maximising the parallel performance of the mesh partition process to minimise the load balancing overhead.

Scientific impact of this result:

The EXCELLERAT software, based on the above mentioned SFC method, can partition 250 Million elements mesh of an airplane within 0.08 seconds using 128 nodes (6,144 CPU-cores) of the MareNostrum V supercomputer. Consequently, mesh partitions can be recomputed at runtime for load balancing without producing a significant overhead. This approach was applied to perform full airplane simulations on the heterogeneous POWER9 cluster installed at the Barcelona Supercomputing Center. In the BSC POWER9 cluster we demonstrated that we could perform a well-balanced co-execution using both the CPUs and GPUs simultaneously.

As a result, we obtained a 23% time reduction with respect to the GPU-only execution. In practice, this represents a performance boost equivalent to attaching an additional GPU per node and thus a much more efficient exploitation of the resources.

Benefits for further research:

  • Well-balanced co-execution using both the CPUs and GPUs simultaneously
  • 23% faster than using only the GPUs
  • Performance boost equivalent to attaching an additional GPU per node
  • Increased resilience of the software to system performance variability
fig1
Comparison of (balanced) co-execution vs. pure GPU execution – elapsed time per MPI Rank.

References: R. Borrell, D. Dosimont, M. Garcia-Gasulla, G. Houzeaux, O. Lehmkuhl, V. Mehta, H. Owen, M. Vázquez, G. Oyarzun, Heterogeneous CPU/GPU coexecution of CFD simulations on the POWER9 architecture: application to airplane aerodynamics, Future Gener. Comp. Sy. 107 (2020) 31–48, doi:10. 1016/j.future.2020.01.045.

Any questions related to this success story? Please contact Ricard Borrell from BSC.