Im Laufe der letzten Jahre hat sich die Nutzung von hoch skalierbaren Rechensystemen (Supercomputern) durch die Industrie von einem exklusiven Kundenkreis, der sich den Aufbau der internen Fachexpertise leisten konnte, hin zu einem erweiterten Kundenkreis entwikkelt, der bislang eher wenig oder keine Erfahrung mit dem Umgang dieser Ressourcen hatte. Höchstleistungsrechenzentren, die den Zugriff auf die Rechensysteme anbieten, haben sich in Folge dessen von der „Bare-Metal“-Bereitstellung hin zum Angebot von Komplettlösungen weiterentwickelt, inklusive der Optimierung von Applikationen, Tests und Zugriff auf Wissen. Eine wichtige Community im Industriebereich, welche High Performance Computing (HPC) bereits für ihre Produktzyklen, sowie Forschung und Entwicklung nutzt, ist im Bereich Ingenieurswissenschaften angesiedelt. Um diesen wirtschaftlichen Zweig auch weiterhin bestmöglich unterstützen zu können, bedarf es eines einfachen Zugangs zu relevanten Dienstleistungen und Wissensquellen, welche die jeweiligen Fragestellungen und Probleme zielgerichtet angehen können. Um diesen Zugang zu realisieren wurde vor drei Jahren eine Arbeitsgemeinschaft
europäischer Höchstleistungsrechenzentren ins Leben gerufen, welche die Vorarbeiten für ein im Dezember startendes Forschungs- und Entwicklungsprojekt legte, um ein Exzellenzzentrum (Centre of Excellence – CoE) für Ingenieurswissenschaften aufzubauen – das EXCELLERAT Projekt.
Many engineering applications require complex frameworks to simulate the intricate and extensive sub-problems involved. However, performance analysis tools can struggle when the complexity of the application frameworks increases. In this paper, we share our efforts and experiences in analyzing the performance of CODA, a CFD solver for aircraft aerodynamics developed by DLR, ONERA, and Airbus, which is part of a larger framework for multi-disciplinary analysis in aircraft design. CODA is one of the key next-generation engineering applications represented in the European Centre of Excellence for Engineering Applications (EXCELLERAT). The solver features innovative algorithms and advanced software technology concepts dedicated to HPC. It is implemented in Python and C++ and uses multi-level parallelization via MPI or GASPI and OpenMP. We present, from an engineering perspective, the state of the art in performance analysis tools, discuss the demands and challenges, and present first results of the performance analysis of a CODA performance test case.
Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. One of the key bottlenecks for these methods is sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication, we demonstrate that the scalability of performance critical, latency sensitive kernels can achieve almost an order of magnitude better scalability. We introduce a new hybrid MPI/PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in UPC. A detailed description of the implementation and the hybrid interface to FEniCS is given, and we present a detailed performance study of the hybrid implementation on Cray XC40 machines.
EXCELLERAT è un centro di eccellenza europeo per le applicazioni di ingegneria costituito e finanziato dall’Unione Europea all’interno del programma Horizon 2020. Il Centro è di fatto un’iniziativa di numerosi centri di calcolo europei ad alte prestazioni con lo scopo finale di supportare diverse industrie ingegneristiche chiave in Europa nella gestione di applicazioni complesse che utilizzano le tecnologie High Performance Computing. Con questo progetto si vuole in ultima istanza perseguire come obiettivo finale un migliore sfruttamento dei progressi scientifici dell’ingegneria guidata dall’HPC e affrontare in maniera coerente le attuali sfide economiche e sociali a livello europeo. CINECA è uno dei centri proponenti, sarà uno dei primi tre centri in europa ad ospitare un calcolatore pre-exa scale ed è coinvolto in due casi di utilizzo di applicativi con interessanti caratteristiche prospettiche. In questo articolo descriveremo brevemente EXCELLERAT ed entreremo nel dettaglio delle due applicazioni pratiche selezionate
Modern supercomputers allow the simulation of complex phenomena with increased accuracy. Eventually, this requires finer geometric discretizations with larger numbers of mesh elements. In this context, and extrapolating to the Exascale paradigm, meshing operations such as generation, adaptation or partition, become a critical bottleneck within the simulation workflow. In this paper, we focus on mesh partitioning. In particular, we present some improvements carried out on an in-house parallel mesh partitioner based on the Hilbert Space-Filling Curve.
Additionally, taking advantage of its performance, we present the application of the SFC-based partitioning for dynamic load balancing. This method is based on the direct monitoring of the imbalance at runtime and the subsequent re-partitioning of the mesh. The target weights for the optimized partitions are evaluated using a least-squares approximation considering all measurements from previous iterations. In this way, the final partition corresponds to the average performance of the computing devices engaged.
We investigate how the accuracy and certainty of the quantities of interest (QoIs) of canonical wall-bounded turbulent flows are sensitive to various numerical parameters and time averaging. The scale-resolving simulations are performed by Nek5000, an open-source high-order spectral-element code. Different uncertainty quantification (UQ) techniques are utilized in the study. Using non-intrusive polynomial chaos expansion, portraits of error in the QoIs are constructed in the parameter space. The uncertain parameters are taken to be the grid spacing in different directions and the filtering parameters. As a complement to the UQ forward problems, global sensitivity analyses are performed with the results being quantified in the form of Sobol indices. Employing Bayesian optimization based on Gaussian Processes, the possibility of finding optimal combinations of parameters for obtaining QoIs with a given target accuracy is studied. To estimate the uncertainty due to time averaging, the use of different techniques such as classical, batch-based and autoregressive methods is discussed and suggestions are given on how to efficiently integrate such techniques in large-scale simulations. Comparisons of the certainty aspects between high-order and low-order codes (OpenFOAM) are given.
We investigate the aerodynamic performance of active flow control of airfoils and wings using synthetic jets with zero net-mass flow. The study is conducted via wall-resolved and wall-modeled large-eddy simulation using two independent CFD solvers: Alya, a finite elementbased solver; and charLES, a finite-volume-based solver. Our approach is first validated in a NACA4412, for which numerical and experimental results are already available in the literature. The performance of synthetic jets is evaluated for two flow configurations: a SD7003 airfoil at moderate Reynolds number with laminar separation bubble, which is representative of Micro Air Vehicles, and the high-lift configuration of the JAXA Standard Model at realistic Reynolds numbers for landing. In both cases, our predictions indicate that, at high angles of attack, the control successfully eliminates the laminar/turbulent recirculations located downstream the actuator, which increases the aerodynamic performance. Our efforts illustrate the technologyreadiness of large eddy simulation in the design of control strategies for real-world external aerodynamic applications.
The use of Field Programmable Gate Arrays (FPGAs) to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. With the recent developments in FPGA programming technology, the ability to port kernels is becoming far more accessible. However, to gain reasonable performance from this technology it is not enough to simple transfer a code onto the FPGA, instead the algorithm must be rethought and recast in a data-flow style to suit the target architecture. In this paper we describe the porting, via HLS, of one of the most computationally intensive kernels of the Met Office NERC Cloud model (MONC), an atmospheric model used by climate and weather researchers, onto an FPGA. We describe in detail the steps taken to adapt the algorithm to make it suitable for the architecture and the impact this has on kernel performance. Using a PCIe mounted FPGA with on-board DRAM, we consider the integration on this kernel within a larger infrastructure and explore the performance characteristics of our approach in contrast to Intel CPUs that are popular in modern HPC machines, over problem sizes involving very large grids. The result of this work is an experience report detailing the challenges faced and lessons learnt in porting this complex computational kernel to FPGAs, as well as exploring the role that FPGAs can play and their fundamental limits in accelerating traditional HPC workloads.
The use of reconfigurable computing, and FPGAs in particular, to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. However, whilst recent advanced in FPGA tooling have made the physical act of programming reconfigurable architectures much more accessible, in order to gain good performance the entire algorithm must be rethought and recast in a dataflow style. Reducing the cost of data movement for all computing devices is critically important, and in this paper we explore the most appropriate techniques for FPGAs. We do this by describing the optimisation of an existing FPGA implementation of an atmospheric model’s advection scheme. By taking an FPGA code that was over four times slower than running on the CPU, mainly due to data movement overhead, we describe the profiling and optimisation strategies adopted to significantly reduce the runtime and bring the performance of our FPGA kernels to a much more practical level for real-world use. The result of this work is a set of techniques, steps, and lessons learnt that we have found significantly improves the performance of FPGA based HPC codes and that others can adopt in their own codes to achieve similar results.
This chapter will present the European approach for establishing Centres of Excellence in High-Performance Computing (HPC) applications, ensuring best synergies between participants in the different European countries. Those Centres are user-centric and thus driven by the needs of the respective community stakeholders.Within this chapter, the focus will lie on the respective activity for the Engineering community. It will describe what the aims and goals of such a Centre of Excellence are, how it is realized and what challenges need to be addressed to establish a long-term impacting activity in Europe.
Synthetic (zero net mass flux) jets are an active flow control technique to manipulate the flow field in wall-bounded and free-shear flows. The fluid necessary to actuate on the boundary layer is intermittently injected through an orifice and is driven by the motion of a diaphragm located on a sealed cavity below the surface .
High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs.
Synthetic (zero net mass flux) jets are an active flow control technique to manipulate the flow field in wall-bounded and free-shear flows. The present paper focuses on the role of the periodic actuation mechanisms on the boundary layer of a SD7003 airfoil at Here, Reynolds number is defined in terms of the free-stream velocity U∞ and the airfoil chord C. The actuation is applied near the leading edge of the airfoil and is periodic in time and in the spanwise direction. The actuation successfully eliminates the laminar bubble at , however, it does not produce an increase in the airfoil aerodynamic efficiency. At angles of attack larger than the point of maximum lift, the actuation eliminates the massive flow separation, the flow being attached to the airfoil surface in a significant part of the airfoil chord. As a consequence, airfoil aerodynamic efficiency increases by a 124% with a reduction of the drag coefficient about 46%. This kind of technique seems to be promising at delaying flow separation and its associated losses when the angle of attack increases beyond the maximum lift for the baseline case.
The use of reconfigurable computing, and FPGAs in particular, has strong potential in the field of High Performance Computing (HPC). However the traditionally high barrier to entry when it comes to programming this technology has, until now, precluded widespread adoption. To popularise reconfigurable computing with communities such as HPC, Xilinx have recently released the first version of Vitis, a platform aimed at making the programming of FPGAs much more a question of software development rather than hardware design. However a key question is how well this technology fulfils the aim, and whether the tooling is mature enough such that software developers using FPGAs to accelerate their codes is now a more realistic proposition, or whether it simply increases the convenience for existing experts. To examine this question we use the Himeno benchmark as a vehicle for exploring the Vitis platform for building, executing and optimising HPC codes, describing the different steps and potential pitfalls of the technology. The outcome of this exploration is a demonstration that, whilst Vitis is an excellent step forwards and significantly lowers the barrier to entry in developing codes for FPGAs, it is not a silver bullet and an underlying understanding of dataflow style algorithmic design and appreciation of the architecture is still key to obtaining good performance on reconfigurable architectures.
Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. Key bottlenecks for these methods are sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases, and linear solvers, where efficient overlapping is necessary to amortize communication and synchronization cost of sparse matrix vector multiplication and dot products. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication offered by partitioned global address space languages (PGAS), we demonstrate that the scalability of performance critical, latency sensitive sparse matrix assembly can achieve almost an order of magnitude better scalability. Linear solvers are also addressed via a signaling put algorithm for low-cost point-to-point synchronization, achieving similar performance as message passing based linear solvers. We introduce a new hybrid MPI+PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in Unified Parallel C (UPC). A detailed description of the implementation and the hybrid interface to FEniCS is given, and the feasibility of the approach is demonstrated via a performance study of the hybrid implementation on Cray XC40 machines.
A framework is developed based on different uncertainty quantification (UQ) techniques in order to assess validation and verification (V&V) metrics in computational physics problems, in general, and computational fluid dynamics (CFD), in particular. The metrics include accuracy, sensitivity and robustness of the simulator’s outputs with respect to uncertain inputs and computational parameters. These parameters are divided into two groups: based on the variation of the first group, a computer experiment is designed, the data of which may become uncertain due to the parameters of the second group. To construct a surrogate model based on uncertain data, Gaussian process regression (GPR) with observation-dependent (heteroscedastic) noise structure is used. To estimate the propagated uncertainties in the simulator’s outputs from first and also the combination of first and second groups of parameters, standard and probabilistic polynomial chaos expansions (PCE) are employed, respectively. Global sensitivity analysis based on Sobol decomposition is performed in connection with the computer experiment to rank the parameters based on their influence on the simulator’s output. To illustrate its capabilities, the framework is applied to the scale-resolving simulations of turbulent channel flow using the open-source CFD solver Nek5000. Due to the high-order nature of Nek5000 a thorough assessment of the results’ accuracy and reliability is crucial, as the code is aimed at high-fidelity simulations. The detailed analyses and the resulting conclusions can enhance our insight into the influence of different factors on physics simulations, in particular the simulations of wall-bounded turbulence.
The A64FX processor from Fujitsu, being designed for computational simulation and machine learning applications, has the potential for unprecedented performance in HPC systems. In this paper, we evaluate the A64FX by benchmarking against a range of production HPC platforms that cover a number of processor technologies. We investigate the performance of complex scientific applications across multiple nodes, as well as single node and mini-kernel benchmarks. This paper finds that the performance of the A64FX processor across our chosen benchmarks often significantly exceeds other platforms, even without specific application optimisations for the processor instruction set or hardware. However, this is not true for all the benchmarks we have undertaken. Furthermore, the specific configuration of applications can have an impact on the runtime and performance experienced.
Hardware technological advances are struggling to match scientific ambition, and a key question is how we can use the transistors that we already have more effectively. This is especially true for HPC, where the tendency is often to throw computation at a problem whereas codes themselves are commonly bound, at-least to some extent, by other factors. By redesigning an algorithm and moving from a Von Neumann to dataflow style, then potentially there is more opportunity to address these bottlenecks on reconfigurable architectures, compared to more general-purpose architectures. In this paper we explore the porting of Nekbone’s AX kernel, a widely popular HPC mini-app, to FPGAs using High Level Synthesis via Vitis. Whilst computation is an important part of this code, it is also memory bound on CPUs, and a key question is whether one can ameliorate this by leveraging FPGAs. We first explore optimisation strategies for obtaining good performance, with over a 4000 times runtime difference between the first and final version of our kernel on FPGAs. Subsequently, performance and power efficiency of our approach on an Alveo U280 are compared against a 24 core Xeon Platinum CPU and NVIDIA VI00 GPU, with the FPGA outperforming the CPU by around four times, achieving almost three quarters the GPU performance, and significantly more power efficient than both. The result of this work is a comparison and set of techniques that both apply to Nekbone on FPG As specifically and are also of interest more widely in accelerating HPC codes on reconfigurable architectures.
Following the recent transition in the high performance computing landscape to more heterogeneous architectures, application developers are faced with the challenge of ensuring good performance across a diverse set of platforms. In this paper, we present our work on porting the spectral element code Nek5000 to the recent vector architecture SX-Aurora TSUBASA. Using Nek5000’s mini-app Nekbone, we formulate suitable loop transformations in key kernels, allowing for better vectorization, increasing the baseline performance by a factor of six. Using the new transformations, we demonstrate that the main compute intensive matrix-vector and matrix-matrix multiplication kernels achieves close to half the peak performance of a SX-Aurora core. Our work also addresses the gather-scatter operations, a key kernel for efficient matrix-free spectral element formulation. We introduce a new implementation of Nek5000’s gather-scatter library with mesh topology awareness for improved vectorization via exploitation of the SX-Aurora’s hardware gather-scatter instructions, improving performance with up to 116%. A detailed description of the implementation is given together with a performance study, comparing both single node performance and strong scalability characteristics, running across multiple SX-Aurora cards.
Bayesian optimization (BO) based on Gaussian process regression (GPR) is applied to different CFD (computational fluid dynamics) problems which can be of practical relevance. The problems are i) shape optimization in a lid-driven cavity to minimize or maximize the energy dissipation, ii) shape optimization of the wall of a channel flow in order to obtain a desired pressure-gradient distribution along the edge of the turbulent boundary layer formed on the other wall, and finally, iii) optimization of the controlling parameters of a spoiler-ice model to attain the aerodynamic characteristics of the airfoil with an actual surface ice. The diversity of the optimization problems, independence of the optimization approach from any adjoint information, the ease of employing different CFD solvers in the optimization loop, and more importantly, the relatively small number of the required flow simulations reveal the flexibility, efficiency, and versatility of the BO-GPR approach in CFD applications. It is shown that to ensure finding the global optimum of the design parameters of the size up to 8, less than 90 executions of the CFD solvers are needed. Furthermore, it is observed that the number of flow simulations does not significantly increase with the number of design parameters. The associated computational cost of these simulations can be affordable for many optimization cases with practical relevance.
In computational physics, mathematical models are numerically solved and as a result, realiza-tions for the quantities of interest (QoIs) are obtained. Even when adopting the most accuratenumerical methods for deterministic mathematical models, the QoIs can still be up to someextent uncertain. Uncertainty is defined as the lack of certainty and it originates from thelack, impropriety or insufficiency of knowledge and information (Ghanem et al., 2017;Smith,2013). It is important to note that for a QoI, uncertainty is different from error which is de-fined as the deviation of a realization from a reference (true) value. In computational models,various sources of uncertainties may exist. These include, but not limited to, the fidelity of themathematical model (i.e., the extent by which the model can reflect the truth), the parame-ters in the models, initial data and boundary conditions, finite sampling time when computingthe time-averaged QoIs, the way numerical errors interact and evolve, computer arithmetic,coding bugs, geometrical uncertainties, etc. Various mathematical and statistical techniquesgathered under the umbrella of uncertainty quantification (UQ) can be exploited to assess theuncertainty in different models and their QoIs (Ghanem et al., 2017;Smith, 2013).
The present study focuses on applying different metrics to assess accuracy, robustness and sensitivity of scale-resolving simulations of turbulent channel flow, when the numerical parameters are systematically varied. Derived by combining well-established uncertainty quantification techniques and computer experiments, the metrics act as powerful tools for understanding the behavior of flow solvers and exploring the impact of their numerical parameters as well as systematically comparing different solvers. A few examples for uncertain behavior of the solvers, i.e. the behaviors that are unexpected or not fully explainable with our a-priori knowledge, is provided. Two open-source software, Nek5000 and OpenFOAM, are considered with the focus on grid resolution and filtering in Nek5000, and grid resolution and numerical dissipation in OpenFOAM. Considering all metrics as well as the computational efficiency, Nek5000 is shown to outperform OpenFOAM. The propagated uncertainty (a measure of robustness) in the profiles of channel flow quantities of interest (QoIs), together with corresponding Sobol sensitivity indices quantitatively measure the impact and relative contribution of different numerical parameters at different wall-distances.
High-fidelity scale-resolving simulations of turbulent flows can be prohibitively expensive, especially at high Reynolds numbers. Therefore, multifidelity models (MFM) can be highly relevant for constructing predictive models for flow quantities of interest (QoIs), uncertainty quantification, and optimization. For numerical simulation of turbulence, there is a hierarchy of methodologies. On the other hand, there are calibration parameters in each of these methods which control the predictive accuracy of the resulting outputs. Compatible with these, the hierarchical MFM strategy which allows for simultaneous calibration of the model parameters as developed by Goh et al.  within a Bayesian framework is considered in the present study. The multifidelity model is applied to two cases related to wall-bounded turbulent flows. The examples are the prediction of friction at different Reynolds numbers in turbulent channel flow, and the prediction of aerodynamic coefficients for a range of angles of attack of a standard airfoil. In both cases, based on a few high-fidelity datasets, the MFM leads to accurate predictions of the QoIs as well as an estimation of uncertainty in the predictions.
Optimising the design of aviation propulsion systems using computational fluid dynamics is essential to increase their efficiency and reduce pollutant as well as noise emissions. Nowadays, and within this optimisation and design phase, it is possible to perform meaningful unsteady computations of the various components of a gas-turbine engine. However, these simulations are often carried out independently of each other and only share averaged quantities at the interfaces minimising the impact and interactions between components. In contrast to the current state-of-the-art, this work presents a 360 azimuthal degrees large-eddy simulation with over 2100 million cells of the DGEN-380 demonstrator engine enclosing a fully integrated fan, compressor and annular combustion chamber at take-off conditions as a first step towards a high-fidelity simulation of the full engine. In order to carry such a challenging simulation and reduce the computational cost, the initial solution is interpolated from stand-alone sectoral simulations of each component. In terms of approach, the integrated mesh is generated in several steps to solve potential machine dependent memory limitations. It is then observed that the 360 degrees computation converges to an operating point with less than 0.5% difference in zero-dimensional values compared to the stand-alone simulations yielding an overall performance within 1% of the designed thermodynamic cycle. With the presented methodology, convergence and azimuthally decorrelated results are achieved for the integrated simulation after only 6 fan revolutions.
Despite the need for understanding the complex physics of the turbulent flows, conducting high-fidelity experiments and scale-resolving numerical simulations can be prohibitively expensive,particularly at high Reynolds numbers which are most relevant to engineering applications. On theother hand, accurate yet cost-effective models are required to be developed for uncertaintyquantification (UQ), prediction and robust optimization for problems involving turbulent flows,where exploration of the space of inputs and design parameters demands a relatively large number offlow realizations. A remedy could be to use multifidelity models (MFM) which aim at predictingaccurate quantities of interest (QoIs) and their statistical moments by combining the data obtainedfrom different fidelities. In this regard, the present study reports our recent progress on further developing and exploiting aclass of multifidelity models which rely on Gaussian processes. Following Goh et al. , at eachhierarchical level in the MFM, the Kennedy-O’Hagan model  is used which allows for consideringboth model inadequacy and aleatoric uncertainties in the process of data fusion. As a main advantageof the present approach, the calibration parameters as well as the hyperparameters appearing in theGaussian processes are simultaneously estimated within a Bayesian framework using a limitednumber of realizations (mostly by running low-fidelity simulations). The constructed MFM can thenbe employed for uncertainty propagation and prediction over the whole input/design parameter space.Another main advantage of the present MFM over other approaches used with regards to turbulentflows is the possibility of incorporating different types of uncertainty in the predictions. The described MFM is applied to periodic hill  where both attached and separating turbulentboundary layers exist, so the results could be relevant to applications in marine technology. Thedesign parameters include those defining the geometry of the curved surfaces. When estimating theresulting uncertainties in the QoIs, the influence of the calibration parameters such as modelling andnumerical parameters in RANS are also considered.