Publications
Publications from EXCELLERAT P2
A significant part in computational fluid dynamics (CFD) simulations is the solving of large sparse systems of linear equations resulting from implicit time integration of the Reynolds-averaged Navier-Stokes (RANS) equations. The sparse linear system solver Spliss aims to provide a linear solver library that, on the one hand, is tailored to these requirements of CFD applications but, on the other hand, independent of the particular CFD solver. Spliss allows leveraging a range of available HPC technologies such as hybrid CPU parallelization and the possibility to offload the computationally intensive linear solver to GPU accelerators, while at the same time hiding this complexity from the CFD solver.
This work highlights the steps taken to establish multi-GPU capabilities for the Spliss solver allowing for efficient and scalable usage of large GPU systems. In addition, this work evaluates performance and scalability on CPU and GPU systems using a representative CODA test case as an example. CODA is the CFD software being developed as part of a collaboration between the French Aerospace Lab ONERA, the German Aerospace Center (DLR), Airbus, and their European research partners. CODA is jointly owned by ONERA, DLR and Airbus. The evaluation examines and compares performance and scalability in a strong scaling approach on Nvidia A100 GPUs and the AMD Rome architecture.
Two-way momentum-coupled direct numerical simulations of a particle-laden turbulent channel flow are addressed to investigate the effect of the particle Stokes number and of the particle-to-fluid density ratio on the turbulence modification. The exact regularised point-particle method is used to model the interphase momentum exchange in presence of solid boundaries, allowing the exploration of an extensive region of the parameter space. Results show that the particles increase the friction drag in the parameter space region considered, namely the Stokes number St+∈[2,80], and the particle-to-fluid density ratio ρp/ρf∈[90,5760] at a fixed mass loading ϕ=0.4. It is noteworthy that the highest drag occurs for small Stokes number particles. A measurable drag increase occurs for all particle-to-fluid density ratios, the effect being reduced significantly only at the highest value of ρp/ρf. The modified stress budget and turbulent kinetic energy equation provide the rationale behind the observed behaviour. The particles’ extra stress causes an additional momentum flux towards the wall that modifies the structure of the buffer and of the viscous sublayer where the streamwise and wall-normal velocity fluctuations are increased. Indeed, in the viscous sublayer, additional turbulent kinetic energy is produced by the particles’ back-reaction, resulting in a strong augmentation of the spatial energy flux towards the wall where the energy is ultimately dissipated. This behaviour explains the increase of friction drag in particle-laden wall-bounded flows.
Publications from EXCELLERAT’s first project phase (2018-2022)
Im Laufe der letzten Jahre hat sich die Nutzung von hoch skalierbaren Rechensystemen (Supercomputern) durch die Industrie von einem exklusiven Kundenkreis, der sich den Aufbau der internen Fachexpertise leisten konnte, hin zu einem erweiterten Kundenkreis entwickelt, der bislang eher wenig oder keine Erfahrung mit dem Umgang dieser Ressourcen hatte. Höchstleistungsrechenzentren, die den Zugriff auf die Rechensysteme anbieten, haben sich in Folge dessen von der „Bare-Metal“-Bereitstellung hin zum Angebot von Komplettlösungen weiterentwickelt, inklusive der Optimierung von Applikationen, Tests und Zugriff auf Wissen. Eine wichtige Community im Industriebereich, welche High Performance Computing (HPC) bereits für ihre Produktzyklen, sowie Forschung und Entwicklung nutzt, ist im Bereich Ingenieurswissenschaften angesiedelt. Um diesen wirtschaftlichen Zweig auch weiterhin bestmöglich unterstützen zu können, bedarf es eines einfachen Zugangs zu relevanten Dienstleistungen und Wissensquellen, welche die jeweiligen Fragestellungen und Probleme zielgerichtet angehen können. Um diesen Zugang zu realisieren wurde vor drei Jahren eine Arbeitsgemeinschaft europäischer Höchstleistungsrechenzentren ins Leben gerufen, welche die Vorarbeiten für ein im Dezember startendes Forschungs- und Entwicklungsprojekt legte, um ein Exzellenzzentrum (Centre of Excellence – CoE) für Ingenieurswissenschaften aufzubauen – das EXCELLERAT Projekt.
Many engineering applications require complex frameworks to simulate the intricate and extensive sub-problems involved. However, performance analysis tools can struggle when the complexity of the application frameworks increases. In this paper, we share our efforts and experiences in analyzing the performance of CODA, a CFD solver for aircraft aerodynamics developed by DLR, ONERA, and Airbus, which is part of a larger framework for multi-disciplinary analysis in aircraft design. CODA is one of the key next-generation engineering applications represented in the European Centre of Excellence for Engineering Applications (EXCELLERAT). The solver features innovative algorithms and advanced software technology concepts dedicated to HPC. It is implemented in Python and C++ and uses multi-level parallelization via MPI or GASPI and OpenMP. We present, from an engineering perspective, the state of the art in performance analysis tools, discuss the demands and challenges, and present first results of the performance analysis of a CODA performance test case.
Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. One of the key bottlenecks for these methods is sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication, we demonstrate that the scalability of performance critical, latency sensitive kernels can achieve almost an order of magnitude better scalability. We introduce a new hybrid MPI/PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in UPC. A detailed description of the implementation and the hybrid interface to FEniCS is given, and we present a detailed performance study of the hybrid implementation on Cray XC40 machines.
EXCELLERAT è un centro di eccellenza europeo per le applicazioni di ingegneria costituito e finanziato dall’Unione Europea all’interno del programma Horizon 2020. Il Centro è di fatto un’iniziativa di numerosi centri di calcolo europei ad alte prestazioni con lo scopo finale di supportare diverse industrie ingegneristiche chiave in Europa nella gestione di applicazioni complesse che utilizzano le tecnologie High Performance Computing. Con questo progetto si vuole in ultima istanza perseguire come obiettivo finale un migliore sfruttamento dei progressi scientifici dell’ingegneria guidata dall’HPC e affrontare in maniera coerente le attuali sfide economiche e sociali a livello europeo. CINECA è uno dei centri proponenti, sarà uno dei primi tre centri in europa ad ospitare un calcolatore pre-exa scale ed è coinvolto in due casi di utilizzo di applicativi con interessanti caratteristiche prospettiche. In questo articolo descriveremo brevemente EXCELLERAT ed entreremo nel dettaglio delle due applicazioni pratiche selezionate
Modern supercomputers allow the simulation of complex phenomena with increased accuracy. Eventually, this requires finer geometric discretizations with larger numbers of mesh elements. In this context, and extrapolating to the Exascale paradigm, meshing operations such as generation, adaptation or partition, become a critical bottleneck within the simulation workflow. In this paper, we focus on mesh partitioning. In particular, we present some improvements carried out on an in-house parallel mesh partitioner based on the Hilbert Space-Filling Curve.
Additionally, taking advantage of its performance, we present the application of the SFC-based partitioning for dynamic load balancing. This method is based on the direct monitoring of the imbalance at runtime and the subsequent re-partitioning of the mesh. The target weights for the optimized partitions are evaluated using a least-squares approximation considering all measurements from previous iterations. In this way, the final partition corresponds to the average performance of the computing devices engaged.
We investigate how the accuracy and certainty of the quantities of interest (QoIs) of canonical wall-bounded turbulent flows are sensitive to various numerical parameters and time averaging. The scale-resolving simulations are performed by Nek5000, an open-source high-order spectral-element code. Different uncertainty quantification (UQ) techniques are utilized in the study. Using non-intrusive polynomial chaos expansion, portraits of error in the QoIs are constructed in the parameter space. The uncertain parameters are taken to be the grid spacing in different directions and the filtering parameters. As a complement to the UQ forward problems, global sensitivity analyses are performed with the results being quantified in the form of Sobol indices. Employing Bayesian optimization based on Gaussian Processes, the possibility of finding optimal combinations of parameters for obtaining QoIs with a given target accuracy is studied. To estimate the uncertainty due to time averaging, the use of different techniques such as classical, batch-based and autoregressive methods is discussed and suggestions are given on how to efficiently integrate such techniques in large-scale simulations. Comparisons of the certainty aspects between high-order and low-order codes (OpenFOAM) are given.
We investigate the aerodynamic performance of active flow control of airfoils and wings using synthetic jets with zero net-mass flow. The study is conducted via wall-resolved and wall-modeled large-eddy simulation using two independent CFD solvers: Alya, a finite elementbased solver; and charLES, a finite-volume-based solver. Our approach is first validated in a NACA4412, for which numerical and experimental results are already available in the literature. The performance of synthetic jets is evaluated for two flow configurations: a SD7003 airfoil at moderate Reynolds number with laminar separation bubble, which is representative of Micro Air Vehicles, and the high-lift configuration of the JAXA Standard Model at realistic Reynolds numbers for landing. In both cases, our predictions indicate that, at high angles of attack, the control successfully eliminates the laminar/turbulent recirculations located downstream the actuator, which increases the aerodynamic performance. Our efforts illustrate the technologyreadiness of large eddy simulation in the design of control strategies for real-world external aerodynamic applications.
The use of Field Programmable Gate Arrays (FPGAs) to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. With the recent developments in FPGA programming technology, the ability to port kernels is becoming far more accessible. However, to gain reasonable performance from this technology it is not enough to simple transfer a code onto the FPGA, instead the algorithm must be rethought and recast in a data-flow style to suit the target architecture. In this paper we describe the porting, via HLS, of one of the most computationally intensive kernels of the Met Office NERC Cloud model (MONC), an atmospheric model used by climate and weather researchers, onto an FPGA. We describe in detail the steps taken to adapt the algorithm to make it suitable for the architecture and the impact this has on kernel performance. Using a PCIe mounted FPGA with on-board DRAM, we consider the integration on this kernel within a larger infrastructure and explore the performance characteristics of our approach in contrast to Intel CPUs that are popular in modern HPC machines, over problem sizes involving very large grids. The result of this work is an experience report detailing the challenges faced and lessons learnt in porting this complex computational kernel to FPGAs, as well as exploring the role that FPGAs can play and their fundamental limits in accelerating traditional HPC workloads.
The use of reconfigurable computing, and FPGAs in particular, to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. However, whilst recent advanced in FPGA tooling have made the physical act of programming reconfigurable architectures much more accessible, in order to gain good performance the entire algorithm must be rethought and recast in a dataflow style. Reducing the cost of data movement for all computing devices is critically important, and in this paper we explore the most appropriate techniques for FPGAs. We do this by describing the optimisation of an existing FPGA implementation of an atmospheric model’s advection scheme. By taking an FPGA code that was over four times slower than running on the CPU, mainly due to data movement overhead, we describe the profiling and optimisation strategies adopted to significantly reduce the runtime and bring the performance of our FPGA kernels to a much more practical level for real-world use. The result of this work is a set of techniques, steps, and lessons learnt that we have found significantly improves the performance of FPGA based HPC codes and that others can adopt in their own codes to achieve similar results.
This chapter will present the European approach for establishing Centres of Excellence in High-Performance Computing (HPC) applications, ensuring best synergies between participants in the different European countries. Those Centres are user-centric and thus driven by the needs of the respective community stakeholders.Within this chapter, the focus will lie on the respective activity for the Engineering community. It will describe what the aims and goals of such a Centre of Excellence are, how it is realized and what challenges need to be addressed to establish a long-term impacting activity in Europe.
Synthetic (zero net mass flux) jets are an active flow control technique to manipulate the flow field in wall-bounded and free-shear flows. The fluid necessary to actuate on the boundary layer is intermittently injected through an orifice and is driven by the motion of a diaphragm located on a sealed cavity below the surface .
High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs.
Synthetic (zero net mass flux) jets are an active flow control technique to manipulate the flow field in wall-bounded and free-shear flows. The present paper focuses on the role of the periodic actuation mechanisms on the boundary layer of a SD7003 airfoil at Re=U∞C/ν=6×104Re=U∞C/ν=6×104 Here, Reynolds number is defined in terms of the free-stream velocity U∞ and the airfoil chord C. The actuation is applied near the leading edge of the airfoil and is periodic in time and in the spanwise direction. The actuation successfully eliminates the laminar bubble at AoA=4∘AoA=4∘, however, it does not produce an increase in the airfoil aerodynamic efficiency. At angles of attack larger than the point of maximum lift, the actuation eliminates the massive flow separation, the flow being attached to the airfoil surface in a significant part of the airfoil chord. As a consequence, airfoil aerodynamic efficiency increases by a 124% with a reduction of the drag coefficient about 46%. This kind of technique seems to be promising at delaying flow separation and its associated losses when the angle of attack increases beyond the maximum lift for the baseline case.
We investigate the aerodynamic performance of active flow control of airfoils and wings using synthetic jets with zero net-mass flow. The study is conducted via wall-resolved and wall-modeled large-eddy simulation using two independent CFD solvers: Alya, a finite-element-based solver; and charLES, a finite-volume-based solver. Our approach is first validated in a NACA4412, for which numerical and experimental results are already available in the literature. The performance of synthetic jets is evaluated for two flow configurations: a SD7003 airfoil at moderate Reynolds number with laminar separation bubble, which is representative of Micro Air Vehicles, and the high-lift configuration of the JAXA Standard Model at realistic Reynolds numbers for landing. In both cases, our predictions indicate that, at high angles of attack, the control successfully eliminates the laminar/turbulent recirculations located downstream the actuator, which increases the aerodynamic performance. Our efforts illustrate the technology-readiness of large-eddy simulation in the design of control strategies for real-world external aerodynamic applications.
The use of reconfigurable computing, and FPGAs in particular, has strong potential in the field of High Performance Computing (HPC). However the traditionally high barrier to entry when it comes to programming this technology has, until now, precluded widespread adoption. To popularise reconfigurable computing with communities such as HPC, Xilinx have recently released the first version of Vitis, a platform aimed at making the programming of FPGAs much more a question of software development rather than hardware design. However a key question is how well this technology fulfils the aim, and whether the tooling is mature enough such that software developers using FPGAs to accelerate their codes is now a more realistic proposition, or whether it simply increases the convenience for existing experts. To examine this question we use the Himeno benchmark as a vehicle for exploring the Vitis platform for building, executing and optimising HPC codes, describing the different steps and potential pitfalls of the technology. The outcome of this exploration is a demonstration that, whilst Vitis is an excellent step forwards and significantly lowers the barrier to entry in developing codes for FPGAs, it is not a silver bullet and an underlying understanding of dataflow style algorithmic design and appreciation of the architecture is still key to obtaining good performance on reconfigurable architectures.
A framework is developed based on different uncertainty quantification (UQ) techniques in order to assess validation and verification (V&V) metrics in computational physics problems, in general, and computational fluid dynamics (CFD), in particular. The metrics include accuracy, sensitivity and robustness of the simulator’s outputs with respect to uncertain inputs and computational parameters. These parameters are divided into two groups: based on the variation of the first group, a computer experiment is designed, the data of which may become uncertain due to the parameters of the second group. To construct a surrogate model based on uncertain data, Gaussian process regression (GPR) with observation-dependent (heteroscedastic) noise structure is used. To estimate the propagated uncertainties in the simulator’s outputs from first and also the combination of first and second groups of parameters, standard and probabilistic polynomial chaos expansions (PCE) are employed, respectively. Global sensitivity analysis based on Sobol decomposition is performed in connection with the computer experiment to rank the parameters based on their influence on the simulator’s output. To illustrate its capabilities, the framework is applied to the scale-resolving simulations of turbulent channel flow using the open-source CFD solver Nek5000. Due to the high-order nature of Nek5000 a thorough assessment of the results’ accuracy and reliability is crucial, as the code is aimed at high-fidelity simulations. The detailed analyses and the resulting conclusions can enhance our insight into the influence of different factors on physics simulations, in particular the simulations of wall-bounded turbulence.
The A64FX processor from Fujitsu, being designed for computational simulation and machine learning applications, has the potential for unprecedented performance in HPC systems. In this paper, we evaluate the A64FX by benchmarking against a range of production HPC platforms that cover a number of processor technologies. We investigate the performance of complex scientific applications across multiple nodes, as well as single node and mini-kernel benchmarks. This paper finds that the performance of the A64FX processor across our chosen benchmarks often significantly exceeds other platforms, even without specific application optimisations for the processor instruction set or hardware. However, this is not true for all the benchmarks we have undertaken. Furthermore, the specific configuration of applications can have an impact on the runtime and performance experienced.
Hardware technological advances are struggling to match scientific ambition, and a key question is how we can use the transistors that we already have more effectively. This is especially true for HPC, where the tendency is often to throw computation at a problem whereas codes themselves are commonly bound, at-least to some extent, by other factors. By redesigning an algorithm and moving from a Von Neumann to dataflow style, then potentially there is more opportunity to address these bottlenecks on reconfigurable architectures, compared to more general-purpose architectures. In this paper we explore the porting of Nekbone’s AX kernel, a widely popular HPC mini-app, to FPGAs using High Level Synthesis via Vitis. Whilst computation is an important part of this code, it is also memory bound on CPUs, and a key question is whether one can ameliorate this by leveraging FPGAs. We first explore optimisation strategies for obtaining good performance, with over a 4000 times runtime difference between the first and final version of our kernel on FPGAs. Subsequently, performance and power efficiency of our approach on an Alveo U280 are compared against a 24 core Xeon Platinum CPU and NVIDIA VI00 GPU, with the FPGA outperforming the CPU by around four times, achieving almost three quarters the GPU performance, and significantly more power efficient than both. The result of this work is a comparison and set of techniques that both apply to Nekbone on FPG As specifically and are also of interest more widely in accelerating HPC codes on reconfigurable architectures.
Following the recent transition in the high performance computing landscape to more heterogeneous architectures, application developers are faced with the challenge of ensuring good performance across a diverse set of platforms. In this paper, we present our work on porting the spectral element code Nek5000 to the recent vector architecture SX-Aurora TSUBASA. Using Nek5000’s mini-app Nekbone, we formulate suitable loop transformations in key kernels, allowing for better vectorization, increasing the baseline performance by a factor of six. Using the new transformations, we demonstrate that the main compute intensive matrix-vector and matrix-matrix multiplication kernels achieves close to half the peak performance of a SX-Aurora core. Our work also addresses the gather-scatter operations, a key kernel for efficient matrix-free spectral element formulation. We introduce a new implementation of Nek5000’s gather-scatter library with mesh topology awareness for improved vectorization via exploitation of the SX-Aurora’s hardware gather-scatter instructions, improving performance with up to 116%. A detailed description of the implementation is given together with a performance study, comparing both single node performance and strong scalability characteristics, running across multiple SX-Aurora cards.
Bayesian optimization (BO) based on Gaussian process regression (GPR) is applied to different CFD (computational fluid dynamics) problems which can be of practical relevance. The problems are i) shape optimization in a lid-driven cavity to minimize or maximize the energy dissipation, ii) shape optimization of the wall of a channel flow in order to obtain a desired pressure-gradient distribution along the edge of the turbulent boundary layer formed on the other wall, and finally, iii) optimization of the controlling parameters of a spoiler-ice model to attain the aerodynamic characteristics of the airfoil with an actual surface ice. The diversity of the optimization problems, independence of the optimization approach from any adjoint information, the ease of employing different CFD solvers in the optimization loop, and more importantly, the relatively small number of the required flow simulations reveal the flexibility, efficiency, and versatility of the BO-GPR approach in CFD applications. It is shown that to ensure finding the global optimum of the design parameters of the size up to 8, less than 90 executions of the CFD solvers are needed. Furthermore, it is observed that the number of flow simulations does not significantly increase with the number of design parameters. The associated computational cost of these simulations can be affordable for many optimization cases with practical relevance.
The present study focuses on applying different metrics to assess accuracy, robustness and sensitivity of scale-resolving simulations of turbulent channel flow, when the numerical parameters are systematically varied. Derived by combining well-established uncertainty quantification techniques and computer experiments, the metrics act as powerful tools for understanding the behavior of flow solvers and exploring the impact of their numerical parameters as well as systematically comparing different solvers. A few examples for uncertain behavior of the solvers, i.e. the behaviors that are unexpected or not fully explainable with our a-priori knowledge, is provided. Two open-source software, Nek5000 and OpenFOAM, are considered with the focus on grid resolution and filtering in Nek5000, and grid resolution and numerical dissipation in OpenFOAM. Considering all metrics as well as the computational efficiency, Nek5000 is shown to outperform OpenFOAM. The propagated uncertainty (a measure of robustness) in the profiles of channel flow quantities of interest (QoIs), together with corresponding Sobol sensitivity indices quantitatively measure the impact and relative contribution of different numerical parameters at different wall-distances.
High-fidelity scale-resolving simulations of turbulent flows can be prohibitively expensive, especially at high Reynolds numbers. Therefore, multifidelity models (MFM) can be highly relevant for constructing predictive models for flow quantities of interest (QoIs), uncertainty quantification, and optimization. For numerical simulation of turbulence, there is a hierarchy of methodologies. On the other hand, there are calibration parameters in each of these methods which control the predictive accuracy of the resulting outputs. Compatible with these, the hierarchical MFM strategy which allows for simultaneous calibration of the model parameters as developed by Goh et al. [7] within a Bayesian framework is considered in the present study. The multifidelity model is applied to two cases related to wall-bounded turbulent flows. The examples are the prediction of friction at different Reynolds numbers in turbulent channel flow, and the prediction of aerodynamic coefficients for a range of angles of attack of a standard airfoil. In both cases, based on a few high-fidelity datasets, the MFM leads to accurate predictions of the QoIs as well as an estimation of uncertainty in the predictions.
Optimising the design of aviation propulsion systems using computational fluid dynamics is essential to increase their efficiency and reduce pollutant as well as noise emissions. Nowadays, and within this optimisation and design phase, it is possible to perform meaningful unsteady computations of the various components of a gas-turbine engine. However, these simulations are often carried out independently of each other and only share averaged quantities at the interfaces minimising the impact and interactions between components. In contrast to the current state-of-the-art, this work presents a 360 azimuthal degrees large-eddy simulation with over 2100 million cells of the DGEN-380 demonstrator engine enclosing a fully integrated fan, compressor and annular combustion chamber at take-off conditions as a first step towards a high-fidelity simulation of the full engine. In order to carry such a challenging simulation and reduce the computational cost, the initial solution is interpolated from stand-alone sectoral simulations of each component. In terms of approach, the integrated mesh is generated in several steps to solve potential machine dependent memory limitations. It is then observed that the 360 degrees computation converges to an operating point with less than 0.5% difference in zero-dimensional values compared to the stand-alone simulations yielding an overall performance within 1% of the designed thermodynamic cycle. With the presented methodology, convergence and azimuthally decorrelated results are achieved for the integrated simulation after only 6 fan revolutions.
Optimising the design of aviation propulsion systems using computational fluid dynamics is essential to increase their efficiency and reduce pollutant as well as noise emissions. Nowadays, and within this optimisation and design phase, it is possible to perform meaningful unsteady computations of the various components of a gas-turbine engine. However, these simulations are often carried out independently of each other and only share averaged quantities at the interfaces minimising the impact and interactions between components. In contrast to the current state-of-the-art, this work presents a 360 azimuthal degrees large-eddy simulation with over 2100 million cells of the DGEN-380 demonstrator engine enclosing a fully integrated fan, compressor and annular combustion chamber at take-off conditions as a first step towards a high-fidelity simulation of the full engine. In order to carry such a challenging simulation and reduce the computational cost, the initial solution is interpolated from stand-alone sectoral simulations of each component. In terms of approach, the integrated mesh is generated in several steps to solve potential machine dependent memory limitations. It is then observed that the 360 degrees computation converges to an operating point with less than 0.5% difference in zero-dimensional values compared to the stand-alone simulations yielding an overall performance within 1% of the designed thermodynamic cycle. With the presented methodology, convergence and azimuthally decorrelated results are achieved for the integrated simulation after only 6 fan revolutions.
Bayesian optimisation based on Gaussian process regression (GPR) is an efficient gradient-free algorithm widely used in various fields of data sciences to find global optima. Based on a recent study by the authors, Bayesian optimisation is shown to be applicable to optimisation problems based on simulations of different fluid flows. Examples range from academic to more industrially-relevant cases. As a main conclusion, the number of flow simulations required in Bayesian optimisation was found not to exponentially grow with the dimensionality of the design parameters (hence, no curse of dimensionality). Here, the Bayesian optimisation method is outlined and its application to the shape optimisation of a two-dimensional lid-driven cavity flow is detailed.
The flow topology of the wake behind a circular cylinder at the super-critical Reynolds number of Re=7.2×105 is investigated by means of large eddy simulations. In spite of the many research works on circular cylinders, there are no studies concerning the main characteristics and topology of the near wake in the super-critical regime. Thus, the present work attempts to fill the gap in the literature and contribute to the analysis of both the unsteady wake and the turbulent statistics of the flow. It is found that although the wake is symmetric and preserves similar traits to those observed in the sub-critical regime, such as the typical two-lobed configuration in the vortex formation zone, important differences are also observed. Owing to the delayed separation of the flow and the transition to turbulence in the attached boundary layer, Reynolds stresses peak in the detached shear layers close to the separation point. The unsteady mean flow is also investigated, and topological critical points are identified in the vortex formation zone and the near wake. Finally, time-frequency analysis is performed by means of wavelets. The study shows that in addition to the vortex shedding frequency, the inception of instabilities that trigger transition to turbulence occurs intermittently in the attached boundary layer and is registered as a phenomenon of variable intensity in time.
Engineering is an important domain for supercomputing, with the Alya model being a popular code for undertaking such simulations. With ever increasing demand from users to model larger, more complex systems at reduced time to solution it is important to explore the role that novel hardware technologies, such as FPGAs, can play in accelerating these workloads on future exascale systems.In this paper we explore the porting of Alya’s incompressible flow matrix assembly kernel, which accounts for a large proportion of the model runtime, onto FPGAs. After describing in detail successful strategies for optimisation at the kernel level, we then explore sharing the workload between the FPGA and host CPU, mapping most appropriate parts of the kernel between these technologies, enabling us to more effectively exploit the FPGA. We then compare the performance of our approach on a Xilinx Alveo U280 against a 24-core Xeon Platinum CPU and Nvidia V100 GPU, with the FPGA significantly out-performing the CPU and performing comparably against the GPU, whilst drawing substantially less power. The result of this work is both an experience report describing appropriate dataflow optimisations which we believe can be applied more widely as a case-study across HPC codes, and a performance comparison for this specific workload that demonstrates the potential for FPGAs in accelerating HPC engineering simulations.
In computational physics, mathematical models are numerically solved and as a result, realizations for the quantities of interest (QoIs) are obtained. Even when adopting the most accurate numerical methods for deterministic mathematical models, the QoIs can still be up to some extent uncertain. Uncertainty is defined as the lack of certainty and it originates from the lack, impropriety or insufficiency of knowledge and information (Ghanem et al., 2017; Smith, 2013). It is important to note that for a QoI, uncertainty is different from error which is defined as the deviation of a realization from a reference (true) value. In computational models, various sources of uncertainties may exist. These include, but not limited to, the fidelity of the mathematical model (i.e., the extent by which the model can reflect the truth), the parameters in the models, initial data and boundary conditions, finite sampling time when computing
the time-averaged QoIs, the way numerical errors interact and evolve, computer arithmetic, coding bugs, geometrical uncertainties, etc. Various mathematical and statistical techniques gathered under the umbrella of uncertainty quantification (UQ) can be exploited to assess the uncertainty in different models and their QoIs (Ghanem et al., 2017; Smith, 2013). The UQ techniques not only facilitate systematic evaluation of validation and verification metrics, but also play a vital role in evaluation of the confidence and reliability of the data acquired in computations and experiments. Note that accurate accounting for such confidence intervals is crucial in data-driven engineering designs.
Reconfigurable architectures, such as FPGAs, execute code at the electronics level, avoiding assumptions imposed by the general purpose black-box micro-architectures of CPUs and GPUs. Such tailored execution can result in increased performance and power efficiency, and as the HPC community moves towards exascale an important question is the role these hardware technologies can play in future supercomputers.In this paper we explore the porting of the PW advection kernel, an important code component used in a variety of atmospheric simulations and accounting for around 40% of the runtime of the popular Met Office NERC Cloud model (MONC). Building upon previous work which ported this kernel to an older generation of Xilinx FPGA, we target latest generation Xilinx Alveo U280 and Intel Stratix 10 FPGAs. Designing around the abstraction of an Application Specific Dataflow Machine (ASDM), we develop a design which is performance portable between vendors and explore implementation differences between the tool chains and compare kernel performance between FPGA hardware. This is followed by a more general performance comparison, scaling up the number of kernels on the Xilinx Alveo and Intel Stratix 10, against a 24 core Xeon Platinum Cascade Lake CPU and NVIDIA Tesla V100 GPU. When overlapping the transfer of data to and from the boards with compute, the FPGA solutions considerably outperform the CPU and, whilst falling short of the GPU in terms of performance, demonstrate power usage benefits, with the Alveo being especially power efficient. The result of this work is a comparison and set of design techniques that apply both to this specific atmospheric advection kernel on Xilinx and Intel FPGAs, and that are also of interest more widely when looking to accelerate HPC codes on a variety of reconfigurable architectures.
Conjugate Gradient is a widely used iterative method to solve linear systems Ax=b with matrix A being symmetric and positive definite. Part of its effectiveness relies on finding a suitable preconditioner that accelerates its convergence. Factorized Sparse Approximate Inverse (FSAI) preconditioners are a prominent and easily parallelizable option. An essential element of a FSAI preconditioner is the definition of its sparse pattern, which constraints the approximation of the inverse A-1. This definition is generally based on numerical criteria. In this paper we introduce complementary architecture-aware criteria to increase the numerical effectiveness of the preconditioner without incurring in significant performance costs. In particular, we define cache-aware pattern extensions that do not trigger additional cache misses when accessing vector x in the y=Ax Sparse Matrix-Vector (SpMV) kernel. As a result, we obtain very significant reductions in terms of average solution time ranging between 12.94% and 22.85% on three different architectures – Intel Skylake, POWER9 and A64FX – over a set of 72 test matrices.
Adaptive mesh refinement (AMR) in the high-order spectral-element method code Nek5000 is demonstrated and validated with well-resolved large-eddy simulations (LES) of the flow past a wing profile. In the present work, the flow around a NACA 4412 profile at a chord-based Reynolds number Rec=200,000 is studied at two different angles of attack: 5 and 11 degrees. The mesh is evolved from a very coarse initial mesh by means of volume-weighted spectral error indicators, until a sufficient level of resolution is achieved at the boundary and wake regions. The non-conformal implementation of AMR allows the use of a large domain avoiding the need of a precursor RANS simulation to obtain the boundary conditions (BCs). This eliminates the effect of the steady Dirichlet BCs on the flow, which becomes a relevant source of error at higher angles of attack (specially near the trailing edge and wake regions). Furthermore, over-refinement in the far field and the associated high-aspect ratio elements are avoided, meaning less pressure-iterations of the solver and a reduced number of elements, which leads to a considerable computational cost reduction. Mean flow statistics are validated using experimental data obtained for the same profile in the Minimum Turbulence Level (MTL) wind tunnel at KTH, as well as with a previous DNS simulation, showing excellent agreement. This work constitutes an important step in the direction of studying stronger pressure gradients and higher Reynolds complex flows with the high fidelity that high-order simulations allow to achieve. Eventually, this database can be used for the development and improvement of turbulence models, in particular wall models.
The Conjugate Gradient (CG) method is an iterative solver targeting linear systems of equations 𝐴𝑥 = 𝑏 where 𝐴 is a symmetric and positive definite matrix. CG convergence properties improve when preconditioning is applied to reduce the condition number of matrix 𝐴. While many different options can be found in the literature, the Factorized Sparse Approximate Inverse (FSAI) preconditioner constitutes a highly parallel option based on approximating 𝐴−1. This paper proposes the Communication-aware Factorized Sparse Approximate Inverse preconditioner (FSAIE-Comm), a method to generate extensions of the FSAI sparse pattern that are not only cache friendly, but also avoid increasing communication costs in distributed memory systems. We also propose a filtering strategy to reduce inter-process imbalance. We evaluate FSAIE-Comm on a heterogeneous set of 39 matrices achieving an average solution time decrease of 17.98%, 26.44% and 16.74% on three different architectures, respectively, Intel Skylake, Fujitsu A64FX and AMD Zen 2 with respect to FSAI. In addition, we consider a set of 8 large matrices running on up to 32,768 CPU cores, and we achieve an average solution time decrease of 12.59%.
The CAE product development process can no longer be imagined without numerical simulations. In
particular, Computational Fluid Dynamics (CFD) plays a vital role in the prediction of flows around
internal or external geometries that can be found in many products from different industrial sectors. The
resolution and capability of numerical models have been severely improved, whereas the regulations
and the requirements on the product behaviour have drastically increased. This trend leads to increasing
simulation times, and very large development trees for a design development. Together with even
shorter development times the manual comparison of all these design variants becomes cumbersome
and is limited to only few simulations.To address this problem, we propose a machine learning approach to organize the CFD simulations in
a structured way, enabling the interactive exploration and postprocessing of several simulation results.
The developed methodology learns from a set of simulations with parameter variations the relationship
between input and output quantities. For example, the inputs are different variations of the inflow and
viscosity and the output is the velocity or pressure distribution in the domain.In detail, the approach consists of learning a low dimensional parameterization of the flow fields, so in
essence the shapes of the function distributions are learnt as functions of the input parameters providing
a completely new way of exploring designs and for postprocessing several design solutions. This low
dimensional representation of the simulations allows the organization of the simulations in a structured
way, that is finding clusters of simulations that behave similar, which yields to a major advantage in a
further prediction step for new sets of parameters. Our proposed method includes a cluster based
prediction of new designs by radial-basis function surrogate models which yields to much better
forecasting quality in local details compared to baseline proper orthogonal decomposition (POD)
approaches.The approach is demonstrated on an OpenFoam HVAC duct use case.
The Proper Orthogonal Decomposition (POD) has been used for several years in the post-processing of highly-resolved Computational Fluid Dynamics (CFD) simulations. While the POD can provide valuable insights into the spatial-temporal behaviour of single transient flows, it can be challenging to evaluate and compare results when applied to multiple simulations. Therefore, we propose a workflow based on data-driven techniques, namely dimensionality reduction and clustering to extract knowledge from large simulation bundles from transient CFD simulations. We apply this workflow to investigate the flow around two cylinders that contain complex modal structures in the wake region. A special emphasis lies on the formulation of
in-situ algorithms to compute the data-driven representations during run-time of the simulation. This can reduce the amount of data in- and output and enables a simulation monitoring to reduce computational efforts. Finally, a classifier is trained to predict characteristic physical behaviour in the flow only based on the input parameters.
This paper presents a load balancing strategy for reaction rate evaluation and chemistry integration in reacting flow simulations. The large disparity in scales during combustion introduces stiffness in the numerical integration of the PDEs and generates load imbalance during the parallel execution. The strategy is based on the use of the DLB library to redistribute the computing resources at node level, lending additional CPU-cores to higher loaded MPI processes. This approach does not require explicit data transfer and is activated automatically at runtime. Two chemistry descriptions, detailed and reduced, are evaluated on two different configurations: laminar counterflow flame and a turbulent swirl-stabilized flame. For single-node calculations, speedups of 2.3x and 7x are obtained for the detailed and reduced chemistry, respectively. Results on multi-node runs also show that DLB improves the performance of the pure-MPI code similar to single node runs. It is shown DLB can get performance improvements in both detailed and reduced chemistry calculations.
A framework is introduced for accurate estimation of time-average uncertainties in various types of turbulence statistics. A thorough set of guidelines is provided to adjust the different hyperparameters for estimating uncertainty in sample mean estimators (SMEs). For high-order turbulence statistics,
a novel approach is proposed which avoids any linearization and preserves all relevant temporal and spatial correlations and cross-covariances between SMEs. This approach is able to accurately estimate uncertainties in any arbitrary statistical moment. The usability of the approach is demonstrated by applying it to data from direct numerical simulation (DNS) of the turbulent flow over a periodic hill and through a straight circular pipe.