Gauging High Performance Computing needs for industrial applications: a worked example

Foreword

2020: a tipping point for Computer-aided-design and High performance Computing.

Existing monitoring tools on an HPC cluster

What is missing for the engineer? for the software developer?

Retrieving the missing data

What did we learn on this batch of simulations?

Conclusion

Foreword

2020: a tipping point for Computer-aided-design and High performance Computing.

Existing monitoring tools on an HPC cluster

About accounting

HPC clusters usually rely on batch submission software (in the present case, Slurm), which offers a command line interface for users to request compute resources, and performs allocation of these resources according to the rules defined by the sysadmin. Submission software include logs of user activity, with the primary goal of keeping track of and billing CPU-hours according to each user’s consumption. Here, Slurm is used to give an overview of how the consumption of the allocation is distributed between users.

Distribution of the simulation metrics per user. User 5 is invisible on CPU hours because his jobs are small. User 4 used 66% of the total allocation with only 28% of the jobs submitted.

While CPU-hours are directly billed to the user, each run must be started by hand, hence the number of runs is an indication of the associated labor costs. Both of these metrics are therefore useful to assess the total simulation costs.

One should remember that a supercomputer lasts only a handful of years (3 to 7 max) and is expensive. The amortization of the equipment is calculated assuming an average load over 80%. Optimizing production is therefore critical.

The data also reveals user habits, and can inform the sysadmin to tune the job scheduling rules and improve productivity. As an example, it seems users have uneven usage patterns throughout the week.

Distribution of the simulation starts per days of week. There is naturally less activity on the weekend days. This is why most clusters offer usually a *production queue* (8 to 12 hours) on weekdays, and *weekend queue* (48 to 72 hours) on weekends. A lower activity is visible on Fridays, hinting maybe the possibility of a *happy debug Friday* queue if the charge is dropping on a weekly basis.

To get this, the administrator periodically performs a query. For example with the following unix command :

>slurm-report.sh -S 090120 -E now -T -g cfd

The job scheduler is SLURM. The command slurm-report.sh gathers information from September 1st -S 090120 to today -E now for the users group cfd -g cfd. The output looks like this:

       JobID    JobName  Partition     Group    Account  AllocCPUS    Elapsed      User CPUTimeRAW 
------------ ---------- ---------- --------- ---------- ---------- ---------- --------- ---------- 
401637             batc       prod       cfd    default         72   12:00:29     anonymous1    3112488 
401643              lam       prod       cfd    default        540   12:00:22     anonymous1   23339880 
401805         F53NTPID       prod       cfd    default        540   12:00:22     anonymous2   23339880 
401808         F53A1NTP       prod       cfd    default        540   12:00:22     anonymous2   23339880 
401813          OFCo706       prod       cfd    default        540   12:00:22     anonymous2   23339880 
401815          OFCo530       prod       cfd    default        540   03:03:16     anonymous2    5937840 
401837         A_STABLE       prod       cfd    default        540   07:54:25     anonymous6   15371100 
401873             batc       prod       cfd    default         72   00:00:00     anonymous1          0 
401875              lam       prod       cfd    default        540   00:00:00     anonymous1          0 
401916        labs_lolo       prod  stg-elsa    default        540   00:04:16     anonymous3     138240 
401999          NEW_TAK       prod       cfd    default        540   05:22:30     anonymous4   10449000 
402004        Z08_URANS       prod      elsa    default        180   07:47:19     anonymous5    5047020 
(...)

There are now groups and tools focused of extracting the maximum of information from these use database. For example the UCit is developing a framework Analyse-IT based on data-science to explore the database.

The limits of accounting

The database is gathering only the information the job scheduler is aware of. The typical job profile is, at best:

The executable name is is often not exploitable, because it is provided by the user, and can be /awesomepath/new_version.exe.

The exit code must also be handled with care. There is no standard of exit codes for an HPC software, so the only software related exit information is abort (MPI_ABORT for example). Other exit codes are about time-outs, parallel exception, hardware problems, all being external to the run.

The conclusion is therefore simple, there is neither engineering-related data nor software-related data under the watch of the job scheduler.

What is missing for the engineer? for the software developer?

Retrieving the missing data

The natural solution would be to generate a new dedicated database, continuously updated like the accounting one. Unfortunately, many hurdles are waiting ahead : designing a system ; tackling permissions and confidentiality issues ; defining common definitions on performances and crashes ; and probably more unexpected challenges. This natural solution will need several years to be put in motion. We will depict here a more “brute force” approach using what is already at hand. The simulations done by engineers for design purpose are saved on archive disks -hopefully for the company-. A large amount of the relevant setups and log files are still available at the end of the year. Here follows a glimpse of a setup file and a log file for the HPC code AVBP. The typical input file of an AVBP run is stored in the run.params, a keyword-based ASCII file with several blocks depending on the complexity of the run.

$RUN-CONTROL
  solver_type = ns
  diffusion_scheme = FE_2delta
  simulation_end_iteration = 100000
  mixture_name = CH4-AIR-2S-CM2_FLAMMABLE
  reactive_flow  = yes
  combustion_model = TF
  equation_of_state = pg
  LES_model = wale
  (...)
$end_RUN-CONTROL

$OUTPUT-CONTROL
  save_solution = yes
  save_solution.iteration = 10000
  save_solution.name = ./SOLUT/solut
  save_solution.additional = minimum
  (...)
$end_OUTPUT-CONTROL

$INPUT-CONTROL
   (...)
$end_INPUT-CONTROL

$POSTPROC-CONTROL
    (...)
$end_POSTPROC-CONTROL

Apparently, most of the missing information about the configuration are present. However, due to the unavoidable jargon, a translator is needed. For example, few people outside the AVBP users know that pgis the keyword for a perfect gas assumption. The content of an AVBP typical log file is presented hereafter, with many sections cropped for readability.

Number of MPI processes :           20


     __      ______  _____   __      ________ ___
    /\ \    / /  _ \|  __ \  \ \    / /____  / _ \
   /  \ \  / /| |_) | |__) |  \ \  / /    / / | | |
  / /\ \ \/ / |  _ <|  ___/    \ \/ /    / /| | | |
 / ____ \  /  | |_) | |         \  /    / / | |_| |
/_/    \_\/   |____/|_|          \/    /_/ (_)___/


Using branch  : D7_0
Version date  : Mon Jan 5 22:03:30 2015 +0100
Last commit   : 9ae0dd172d8145496e8d62f8bd34f25ae2595956

Computation #1/1

AVBP version : 7.0 beta

(...)

 ----> Building dual graph
    >> generation took 0.118s

 ----> Decomposition library: pmetis
    >> Partitioning took  0.161s

       ___________________________________________________________________________________
       | Boundary patches (no reordering)                                                |
       |_________________________________________________________________________________|
       | Patch number   Patch name                         Boundary condition            |
       | ------------   ----------                         ------------------            |
       | 1              CanyonBottom                       INLET_FILM                    |
       (...)
       | 15             PerioRight                         PERIODIC_AXI                  |
       |_________________________________________________________________________________|

       _______________________________________________________________
       | Info on initial grid                                        |
       |_____________________________________________________________|
       | number of dimensions              : 3                       |
       | number of nodes                   : 34684                   |
       | number of cells                   : 177563                  |
       | number of cell per group          : 100                     |
       | number of boundary nodes          : 9436                    |
       | number of periodic nodes          : 2592                    |
       | number of axi-periodic nodes      : 0                       |
       | Type of axi-periodicity           : 3D                      |
       |_____________________________________________________________|
       | After partitioning                                          |
       |_____________________________________________________________|
       | number of nodes                   : 40538                   |
       | extra nodes due to partitioning   : 5854 [+  16.88‰]        |
       |_____________________________________________________________|

(...)

 ----> Starts the temporal loop.

       Iteration #      Time-step [s]         Total time [s]        Iter/sec [s-1]

                1    0.515455426286E-06    0.515455426286E-06    0.357797003731E+01
             ---> Storing isosurface: isoT
               50    0.517183938876E-06    0.258723148413E-04    0.184026441956E+02
              100    0.510691871555E-06    0.515103318496E-04    0.241920225176E+02
              150    0.517872233906E-06    0.772251848978E-04    0.239538511163E+02
              200    0.523273983650E-06    0.103271506216E-03    0.241928988318E+02
              (...)
            27296    0.547339278850E-06    0.150002454353E-01    0.241129046630E+02

 ----> Solution stored in file : ./SOLUT/solut_00000007_end.h5

             ---> Storing cut: sliceX
             ---> Storing cut: cylinder
             ---> Storing isosurface: isoT

 (...)

 ----> End computation.

 ________________________________________________________________________________________________________

       _____________________________________________________________________________________________
       | 20 MPI tasks             Elapsed real time [s]       [s.cores]      [h.cores]             |
       |___________________________________________________________________________________________|
       | AVBP                   :      1137.27               0.2275E+05     0.6318E+01             |
       | Temporal loop          :      1134.31               0.2269E+05     0.6302E+01             |
       | Per iteration          :       0.0416               0.8311E+00                            |
       | Per iteration and node :   0.1198E-05               0.2396E-04                            |
       | Per iteration and cell :   0.2340E-06               0.4681E-05                            |
       |___________________________________________________________________________________________|


 ----> End of AVBP session


 ***** Maximum memory used : 12922716 B ( 0.123241E+02 MB)

Here again, a lot of useful information for the software developer is present : version info, various performances indicators relatives to this approach, problem size. The only missing information is the reason behind a crash. There is however a way out in this case, because log files are written in chronological order.

If the log file stops before:	the error code is:	Which means:
`Input files read`	100	Ascii input failure
`Initial solution read.`	200	Binary input failure
`Partitioning took.`	300	Partitioning failure
`----> Starts the temporal loop.`	400	Pre-treatment failure
`----> End computation.`	500	Temporal loop failure
`----> End of AVBP session.`	600	Wrap up failure
all flags reached	000	Success

These error codes are code-independent. However, one can increase the granularity of the error codes by adding code-dependent custom codes. For example in AVBP, a thermodynamic equilibrium problem is raised by the error message Error in temperature.F which can return a specific code (e.g. 530) to keep track of this particular outcome.

Gathering the data.

In this brute force approach, a script is run on the filesystem, while complying to the Unix permissions. This data-mining step can take some time. Here is a rough description of the actions we used for this illustration:

First search for the log files , and discard duplicates with a checksum comparison.
Find the corresponding setup file.
Parse the log file with all the regular expression arsenal and compute the error code.
Parse the setup file , again with the regular expression arsenal.
Keep that of the creation date and the login of the creator
File this data into a database.

The final database is then filled with both pre-run information (the setup, version time, and username) and post-run information (error codes, performances, completion).

One should keep in mind two biases already present in the newly created database. First this database will not take into account runs erased during cleanup operations. It cannot be considered exhaustive. Second, all runs are equally represented (with an equal interest). In reality only one out of ten runs, at best, is actually contributing to a motivated engineering conclusion. Some runs are only taken to check to versions of the same code give similar results. The added value of a specific run, compared to another, is subjective.

We however have now a quite interesting database to investigate…

Identifying the content of the harvest

We can invite here machine learning to help us handle this heap of data. Principal Component Analysis (PCA) is a well-known technique to reduce dimensions of data. From a lot of components, we can bring down our runs to a 2, 3, or 4 component descriptions.

The simulations inputs parameters are reduced to two components able to reproduce the variety of the simulations. This cross plot shows the scattering or simulations along this two components. Two major clusters are visible, followed by one, maybe two minor clusters.

Once PCA is done, we do have a simpler representation of the variability on the database, but no clue about what are these clusters. We can gather groups of similar runs thanks to clustering. We see in the following 2-components scatter plot that runs have been automatically separated into three groups that have close parameters and can be described to belong to the same family of runs. This way runs can be sorted without human supervision and bias by model, or dimension or any interesting parameter the model makes arise.

Three families of runs are gathered with a clustering analysis. Groups are sorted in population order: the first group include the biggest number of runs, followed by the second and the third.

Runs can be gathered together, according to their convection scheme, artificial viscosity or LES model, the version of the code used and the mixture name.

The *convection scheme* is a critical precision element. LW is a second order, robust scheme when precision is not needed. TTGC is a third order, low-dissipative and tricky scheme, which is 2.5 time more expensive. The *artificial viscosity* is the strategy used to damp the *scheme* imperfections.

The *LES model* is the way the turbulence is computed at scales smaller than the grid. Like the *scheme*, some options are harder to stabilize (sigma). Note that `DNS` means in this particular case that no model was used. The *mixture name* is the gas composition. AVBP is a reactive solver. `AIR` is your everyday oxygen-nitrogen cocktail, `KERO`, `C3H8` and `CH4` the usual fuels Kerosene and Methane. Other mixtures are for test purpose (`*_qpf_*`) or higher precision fuel kinetics `(fuel)-_(n_specs)_(n_reac)*`.

What did we learn on this batch of simulations?

The actual content of the harvest

This batch of simulation can be divided into three families:

The largest family is about non-reactive (AIR) configurations done with high order schemes (TTGC)
The second family is the reactive runs with classic fuels (KERO, CH4, C3H8) with normal finite elements schemes(LW–FE). Surprisingly for the expert, there is often no artificial viscosity, but a complex LES model (Sigma)
The third family is done with a normal scheme (LW) and no LES model, therefore laminar flow with no need of precision on the convection. This last group is harder to sketch. A decomposition in four group might have helped.

This is giving now a clearer picture on what has been actually simulated.

Version acceptation

HPC users can exhibit a lot of inertia in their version upgrades. Who can blame them: in this continuous bleeding edge optimization, chances are highs that the newer version will bring new bugs. However the software team cannot let them roam free, the older the version is, the harder the support will be. Here we will track the adoption rate of new versions among users.

These are the version of the code used in 2018, 2019, 2020. We see here that the 7.3, released in Sept. 2018, was used a lot more that its sequels, until the 7.6 released in March. 2020. Note that 7.6 even got an “early bird” peak in Dec. 2019 due to a pre-release test campaign

This shows the “popularity” of each versions among users, in decreasing order. Apparently user 2 and 4 never moved to the 7.6 version. User 4 stayed limited to the 7.3 version, while using 66% of the CPU hours.

Quantify the problems

Software support is the elephant in the room : it takes most of the time of software team, but is never mentioned, you probably would jinx it. We were talking about reducing the stress on the support team, maybe we can help them to target the weakest points of the code?

This pie chart highlights the percentage of runs that went through and those which crashed, due to different errors. An error code of ‘0’ indicates a run that went through. Any other error code means it crashed.

This shows that on this database, very few runs actually crashed during the time loop. The main crash causes were found at the setup, either when filling the input file (110) of the binary databases for the boundaries (290).

Here we see that almost 12% of the runs crash due to an error code of ‘110’ which means bad filling of the input file. The CPU hours are not waisted, but the end-user hours are, and this build resentment. Various idea could pop : – maybe the parser is too strict? – or the documentation could be improved? – Could it be that some dialogs are giving the wrong mental model? Here the support team should not give in the temptation to guess. With the new database, one can focus on the runs crashed and get new insights, either by manual browsing, or a new data-science analysis if there are too many.

Some users -an undisclosed but substantial number- run over the time limit, loosing the last stride of unsaved simulation in the process. The reason : downsizing the run could stop it before end of the allocated time, which is seen less efficient.

Actual versus expected performance

If you are an HPC expert, some critical data linked to HPC itself will be of high interest.

We will start will the most basic HPC pledge : my code is faster on a higher number of processors. On a priori benchmarks with controlled test case, the figures are nice and clean. What about an a posteriori survey on the harvest itself?

We compute here the efficiency. This metric is used to compare machine performances with this code and numerical scheme. It is expressed in µs/iteration/degree of freedom. It is the time needed to advance one iteration (a computation step), divided by the number of degrees of freedom (the size of the mesh). Efficiency should not compared between two different codes, since the iteration is a code-specific concept, and the d.o.f. are also difficult to convert.

In one way or another, it must linearly decrease with the number of cores to satisfy the pledge.

Most of the jobs are indeed showing an efficiency decreasing linearly with the number of processes. In this linear zone (1), bigger meshes efficiency is less good, drifting away from the lower bound of the linear zone, but these simulation often use more equations and models. Some CPU consuming tasks show terrible performances (2) and should be investigated. A large simulation is also underperforming (3).

We can see the linear trend with the number of MPI processes. This overall trend is right. Unfortunately, there are terrible outliers : 2 orders of magnitude slower on the same code and same machine cannot not be taken lightly. A special investigation should focus on this.

If we dig further in the code-dependent figures, the cache of AVBP can be optimized with the ncell_group parameter. As this is configuration dependent and features dependent, the process of finding the right tuning has always been heuristic. Here we look at a heat-map of the ncell_group parameter versus MPI processors to go even further and check if the association of both parameters is right.

Survey on what `ncell_group` is tested by the user to improve their simulations on the partitions. The fact that only 5 values are found is the proof very few people take the time to optimize the cache with respect to their simulation

Conclusion

Existing monitoring tools on an HPC cluster

About accounting

The limits of accounting

Engineering-related information.

Software-related information.

Gathering the data.

Identifying the content of the harvest

The actual content of the harvest

Version acceptation

Quantify the problems

Actual versus expected performance