Im Laufe der letzten Jahre hat sich die Nutzung von hoch skalierbaren Rechensystemen (Supercomputern) durch die Industrie von einem exklusiven Kundenkreis, der sich den Aufbau der internen Fachexpertise leisten konnte, hin zu einem erweiterten Kundenkreis entwikkelt, der bislang eher wenig oder keine Erfahrung mit dem Umgang dieser Ressourcen hatte. Höchstleistungsrechenzentren, die den Zugriff auf die Rechensysteme anbieten, haben sich in Folge dessen von der „Bare-Metal“-Bereitstellung hin zum Angebot von Komplettlösungen weiterentwickelt, inklusive der Optimierung von Applikationen, Tests und Zugriff auf Wissen. Eine wichtige Community im Industriebereich, welche High Performance Computing (HPC) bereits für ihre Produktzyklen, sowie Forschung und Entwicklung nutzt, ist im Bereich Ingenieurswissenschaften angesiedelt. Um diesen wirtschaftlichen Zweig auch weiterhin bestmöglich unterstützen zu können, bedarf es eines einfachen Zugangs zu relevanten Dienstleistungen und Wissensquellen, welche die jeweiligen Fragestellungen und Probleme zielgerichtet angehen können. Um diesen Zugang zu realisieren wurde vor drei Jahren eine Arbeitsgemeinschaft
europäischer Höchstleistungsrechenzentren ins Leben gerufen, welche die Vorarbeiten für ein im Dezember startendes Forschungs- und Entwicklungsprojekt legte, um ein Exzellenzzentrum (Centre of Excellence – CoE) für Ingenieurswissenschaften aufzubauen – das EXCELLERAT Projekt.
Many engineering applications require complex frameworks to simulate the intricate and extensive sub-problems involved. However, performance analysis tools can struggle when the complexity of the application frameworks increases. In this paper, we share our efforts and experiences in analyzing the performance of CODA, a CFD solver for aircraft aerodynamics developed by DLR, ONERA, and Airbus, which is part of a larger framework for multi-disciplinary analysis in aircraft design. CODA is one of the key next-generation engineering applications represented in the European Centre of Excellence for Engineering Applications (EXCELLERAT). The solver features innovative algorithms and advanced software technology concepts dedicated to HPC. It is implemented in Python and C++ and uses multi-level parallelization via MPI or GASPI and OpenMP. We present, from an engineering perspective, the state of the art in performance analysis tools, discuss the demands and challenges, and present first results of the performance analysis of a CODA performance test case.
Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. One of the key bottlenecks for these methods is sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication, we demonstrate that the scalability of performance critical, latency sensitive kernels can achieve almost an order of magnitude better scalability. We introduce a new hybrid MPI/PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in UPC. A detailed description of the implementation and the hybrid interface to FEniCS is given, and we present a detailed performance study of the hybrid implementation on Cray XC40 machines.
EXCELLERAT è un centro di eccellenza europeo per le applicazioni di ingegneria costituito e finanziato dall’Unione Europea all’interno del programma Horizon 2020. Il Centro è di fatto un’iniziativa di numerosi centri di calcolo europei ad alte prestazioni con lo scopo finale di supportare diverse industrie ingegneristiche chiave in Europa nella gestione di applicazioni complesse che utilizzano le tecnologie High Performance Computing. Con questo progetto si vuole in ultima istanza perseguire come obiettivo finale un migliore sfruttamento dei progressi scientifici dell’ingegneria guidata dall’HPC e affrontare in maniera coerente le attuali sfide economiche e sociali a livello europeo. CINECA è uno dei centri proponenti, sarà uno dei primi tre centri in europa ad ospitare un calcolatore pre-exa scale ed è coinvolto in due casi di utilizzo di applicativi con interessanti caratteristiche prospettiche. In questo articolo descriveremo brevemente EXCELLERAT ed entreremo nel dettaglio delle due applicazioni pratiche selezionate
Modern supercomputers allow the simulation of complex phenomena with increased accuracy. Eventually, this requires finer geometric discretizations with larger numbers of mesh elements. In this context, and extrapolating to the Exascale paradigm, meshing operations such as generation, adaptation or partition, become a critical bottleneck within the simulation workflow. In this paper, we focus on mesh partitioning. In particular, we present some improvements carried out on an in-house parallel mesh partitioner based on the Hilbert Space-Filling Curve.
Additionally, taking advantage of its performance, we present the application of the SFC-based partitioning for dynamic load balancing. This method is based on the direct monitoring of the imbalance at runtime and the subsequent re-partitioning of the mesh. The target weights for the optimized partitions are evaluated using a least-squares approximation considering all measurements from previous iterations. In this way, the final partition corresponds to the average performance of the computing devices engaged.
We investigate how the accuracy and certainty of the quantities of interest (QoIs) of canonical wall-bounded turbulent flows are sensitive to various numerical parameters and time averaging. The scale-resolving simulations are performed by Nek5000, an open-source high-order spectral-element code. Different uncertainty quantification (UQ) techniques are utilized in the study. Using non-intrusive polynomial chaos expansion, portraits of error in the QoIs are constructed in the parameter space. The uncertain parameters are taken to be the grid spacing in different directions and the filtering parameters. As a complement to the UQ forward problems, global sensitivity analyses are performed with the results being quantified in the form of Sobol indices. Employing Bayesian optimization based on Gaussian Processes, the possibility of finding optimal combinations of parameters for obtaining QoIs with a given target accuracy is studied. To estimate the uncertainty due to time averaging, the use of different techniques such as classical, batch-based and autoregressive methods is discussed and suggestions are given on how to efficiently integrate such techniques in large-scale simulations. Comparisons of the certainty aspects between high-order and low-order codes (OpenFOAM) are given.
We investigate the aerodynamic performance of active flow control of airfoils and wings using synthetic jets with zero net-mass flow. The study is conducted via wall-resolved and wall-modeled large-eddy simulation using two independent CFD solvers: Alya, a finite elementbased solver; and charLES, a finite-volume-based solver. Our approach is first validated in a NACA4412, for which numerical and experimental results are already available in the literature. The performance of synthetic jets is evaluated for two flow configurations: a SD7003 airfoil at moderate Reynolds number with laminar separation bubble, which is representative of Micro Air Vehicles, and the high-lift configuration of the JAXA Standard Model at realistic Reynolds numbers for landing. In both cases, our predictions indicate that, at high angles of attack, the control successfully eliminates the laminar/turbulent recirculations located downstream the actuator, which increases the aerodynamic performance. Our efforts illustrate the technologyreadiness of large eddy simulation in the design of control strategies for real-world external aerodynamic applications.
The use of Field Programmable Gate Arrays (FPGAs) to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. With the recent developments in FPGA programming technology, the ability to port kernels is becoming far more accessible. However, to gain reasonable performance from this technology it is not enough to simple transfer a code onto the FPGA, instead the algorithm must be rethought and recast in a data-flow style to suit the target architecture. In this paper we describe the porting, via HLS, of one of the most computationally intensive kernels of the Met Office NERC Cloud model (MONC), an atmospheric model used by climate and weather researchers, onto an FPGA. We describe in detail the steps taken to adapt the algorithm to make it suitable for the architecture and the impact this has on kernel performance. Using a PCIe mounted FPGA with on-board DRAM, we consider the integration on this kernel within a larger infrastructure and explore the performance characteristics of our approach in contrast to Intel CPUs that are popular in modern HPC machines, over problem sizes involving very large grids. The result of this work is an experience report detailing the challenges faced and lessons learnt in porting this complex computational kernel to FPGAs, as well as exploring the role that FPGAs can play and their fundamental limits in accelerating traditional HPC workloads.
This chapter will present the European approach for establishing Centres of Excellence in High-Performance Computing (HPC) applications, ensuring best synergies between participants in the different European countries. Those Centres are user-centric and thus driven by the needs of the respective community stakeholders.Within this chapter, the focus will lie on the respective activity for the Engineering community. It will describe what the aims and goals of such a Centre of Excellence are, how it is realized and what challenges need to be addressed to establish a long-term impacting activity in Europe.
Synthetic (zero net mass flux) jets are an active flow control technique to manipulate the flow field in wall-bounded and free-shear flows. The fluid necessary to actuate on the boundary layer is intermittently injected through an orifice and is driven by the motion of a diaphragm located on a sealed cavity below the surface .
High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs.
Synthetic (zero net mass flux) jets are an active flow control technique to manipulate the flow field in wall-bounded and free-shear flows. The present paper focuses on the role of the periodic actuation mechanisms on the boundary layer of a SD7003 airfoil at Here, Reynolds number is defined in terms of the free-stream velocity U∞ and the airfoil chord C. The actuation is applied near the leading edge of the airfoil and is periodic in time and in the spanwise direction. The actuation successfully eliminates the laminar bubble at , however, it does not produce an increase in the airfoil aerodynamic efficiency. At angles of attack larger than the point of maximum lift, the actuation eliminates the massive flow separation, the flow being attached to the airfoil surface in a significant part of the airfoil chord. As a consequence, airfoil aerodynamic efficiency increases by a 124% with a reduction of the drag coefficient about 46%. This kind of technique seems to be promising at delaying flow separation and its associated losses when the angle of attack increases beyond the maximum lift for the baseline case.
The use of reconfigurable computing, and FPGAs in particular, has strong potential in the field of High Performance Computing (HPC). However the traditionally high barrier to entry when it comes to programming this technology has, until now, precluded widespread adoption. To popularise reconfigurable computing with communities such as HPC, Xilinx have recently released the first version of Vitis, a platform aimed at making the programming of FPGAs much more a question of software development rather than hardware design. However a key question is how well this technology fulfils the aim, and whether the tooling is mature enough such that software developers using FPGAs to accelerate their codes is now a more realistic proposition, or whether it simply increases the convenience for existing experts. To examine this question we use the Himeno benchmark as a vehicle for exploring the Vitis platform for building, executing and optimising HPC codes, describing the different steps and potential pitfalls of the technology. The outcome of this exploration is a demonstration that, whilst Vitis is an excellent step forwards and significantly lowers the barrier to entry in developing codes for FPGAs, it is not a silver bullet and an underlying understanding of dataflow style algorithmic design and appreciation of the architecture is still key to obtaining good performance on reconfigurable architectures.
Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. Key bottlenecks for these methods are sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases, and linear solvers, where efficient overlapping is necessary to amortize communication and synchronization cost of sparse matrix vector multiplication and dot products. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication offered by partitioned global address space languages (PGAS), we demonstrate that the scalability of performance critical, latency sensitive sparse matrix assembly can achieve almost an order of magnitude better scalability. Linear solvers are also addressed via a signaling put algorithm for low-cost point-to-point synchronization, achieving similar performance as message passing based linear solvers. We introduce a new hybrid MPI+PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in Unified Parallel C (UPC). A detailed description of the implementation and the hybrid interface to FEniCS is given, and the feasibility of the approach is demonstrated via a performance study of the hybrid implementation on Cray XC40 machines.
The A64FX processor from Fujitsu, being designed for computational simulation and machine learning applications, has the potential for unprecedented performance in HPC systems. In this paper, we evaluate the A64FX by benchmarking against a range of production HPC platforms that cover a number of processor technologies. We investigate the performance of complex scientific applications across multiple nodes, as well as single node and mini-kernel benchmarks. This paper finds that the performance of the A64FX processor across our chosen benchmarks often significantly exceeds other platforms, even without specific application optimisations for the processor instruction set or hardware. However, this is not true for all the benchmarks we have undertaken. Furthermore, the specific configuration of applications can have an impact on the runtime and performance experienced.