This tutorial illustrates how computations can be executed on large (or small) compute servers, aka. high-performance-computers (HPC), aka. compute clusters, aka. supercomputers, etc.
This means, the computation is not executed on the local workstation (or laptop) but on some other computer. This approach is particulary handy for large computations, which run for multiple hours or days, since a user can e.g. shutdown or restart his personal computer without killing the compute job.
BoSSS features a set of classes and routines (an API, application programing interface) for communication with compute clusters. This is especially handy for scripting, e.g. for parameter studies, where dozens of computations have to be started and monitored.
First, we initialize the new worksheet; Note:
MetaJobManager.ipynb
.
One can directly load this into Jupyter to interactively work with the following code examples.BoSSSpad.dll
is required.
You must either set #r "BoSSSpad.dll"
to something which is appropirate for your computer
(e.g. C:\Program Files (x86)\FDY\BoSSS\bin\Release\net5.0\BoSSSpad.dll
if you installed the binary distribution),
or, if you are working with the source code, you must compile BoSSSpad
and put it side-by-side to this worksheet file
(from the original location in the repository, you can use the scripts getbossspad.sh
, resp. getbossspad.bat
).#r "BoSSSpad.dll"
//#r "../../../src/L4-application/BoSSSpad/bin/Debug/net6.0/BoSSSpad.dll"
using System;
using System.Collections.Generic;
using System.Linq;
using ilPSP;
using ilPSP.Utils;
using BoSSS.Platform;
using BoSSS.Foundation;
using BoSSS.Foundation.Grid;
using BoSSS.Foundation.Grid.Classic;
using BoSSS.Foundation.IO;
using BoSSS.Solution;
using BoSSS.Solution.Control;
using BoSSS.Solution.GridImport;
using BoSSS.Solution.Statistic;
using BoSSS.Solution.Utils;
using BoSSS.Solution.Gnuplot;
using BoSSS.Application.BoSSSpad;
using BoSSS.Application.XNSE_Solver;
using static BoSSS.Application.BoSSSpad.BoSSSshell;
Init();
First, we have to select a batch system (aka.execution queue, aka. queue) that we want to use. Batch systems are a common approach to organize workloads (aka. compute jobs) on compute clusters. On such systems, a user typically does not starts a simulation manually/interactively. Instead, he specifies a so-called compute job. The scheduler (i.e. the batch system) collects compute jobs from all users on the compute cluster, sorts them according to some priority and puts the jobs into some queue, also called batch. The jobs in the batch are then executed in order, depending on the available hardware and the scheduling policies of the system.
The BoSSS API provides front-ends (clients) for the following batch system software:
BoSSS.Application.BoSSSpad.SlurmClient
for the
Slurm Workload Manager (very prominent on Linux HPC systems)BoSSS.Application.BoSSSpad.MsHPC2012Client
for the Microsoft HPC Pack 2012 and higherBoSSS.Application.BoSSSpad.MiniBatchProcessorClient
for the
mini batch processor, a minimalistic, BoSSS-internal batch system which mimiks
a supercomputer batch system on the local machine.A list of clients for various batch systems, which are loaded at the
Init()
command can be configured through the
~/.BoSSS/etc/BatchProcessorConfig.json
-file.
If this file is missing, a default setting, containing a
mini batch processor, is initialized.
The list of all execution queues can be accessed through:
ExecutionQueues
index | RuntimeLocation | DeploymentBaseDirectory | DeployRuntime | Name | DotnetRuntime | Username | ServerName | ComputeNodes | DefaultJobPriority | SingleNode | AllowedDatabasesPaths |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | win\amd64 | \\fdygitrunner\BoSSStests | True | MSHPC-Gitrunner-HighPrio | dotnet | FDY\jenkinsci | DC2 | [ fdygitrunner ] | Highest | True | [ \\fdygitrunner\BoSSStests, \\fdygitrunner\ValidationTests\databases ] |
1 | win\amd64 | \\fdygitrunner\BoSSStests | False | MSHPC-Gitrunner-DefaultTest | dotnet | FDY\jenkinsci | DC2 | [ hpccluster, hpccluster2, hpcluster3, hpccluster4, fdygitrunner ] | Normal | True | [ \\fdygitrunner\BoSSStests, \\fdygitrunner\ValidationTests\databases ] |
2 | win\amd64 | \\fdygitrunner\ValidationTests\deploy | True | MSHPC-AllNodes | dotnet | FDY\jenkinsci | DC2 | [ hpccluster, hpccluster2, hpcluster3, hpccluster4, fdygitrunner ] | Normal | True | [ \\fdygitrunner\ValidationTests\databases ] |
3 | win\amd64 | \\fdygitrunner\BoSSStests | True | MSHPC-AllNodes-test | dotnet | FDY\jenkinsci | DC2 | [ hpccluster, hpccluster2, hpcluster3, hpccluster4, fdygitrunner ] | Normal | True | [ \\fdygitrunner\BoSSStests ] |
4 | win\amd64 | \\fdygitrunner\ValidationTests\deploy | True | MSHPC-FastNodes | dotnet | FDY\jenkinsci | DC2 | [ hpcluster3, hpccluster4 ] | Normal | True | [ \\fdygitrunner\ValidationTests\databases ] |
In order to run a simulation job, one can either manually select one of these queues -- or, one culd just use the default queue. The default queue for execution can be configured by two options:
DefaultQueueIndex
in configuration file ~/.BoSSS/etc/BatchProcessorConfig.json
~/.BoSSS/etc/DefaultQueuesProjectOverride.txt
The batch processor for local jobs can be started separately (by launching
MiniBatchProcessor.exe
or dotnet MiniBatchProcessor.dll
), which is the prefferred option.
Alternatively, it can be started from Jupyter Notebook; it depends on the operating system, whether the
MiniBatchProcessor.exe
is terminated with the notebook kernel, or not.
If no mini-batch-processor is running, it is started (hopefully) upon Job activation.
In order to use the workflow management,
the very first thing we have to do is to initialize it by defineing
a project name, here it is MetaJobManager_Tutorial
.
This is used to generate names for the compute jobs and to
identify sessions in the database:
BoSSSshell.WorkflowMgm.Init("MetaJobManager_Tutorial");
Project name is set to 'MetaJobManager_Tutorial'. Default Execution queue is chosen for the database. Creating database '\\fdygitrunner\BoSSStests\MetaJobManager_Tutorial'.
For this project, the default execution queue is set to:
GetDefaultQueue()
RuntimeLocation | DeploymentBaseDirectory | DeployRuntime | Name | DotnetRuntime | Username | ServerName | ComputeNodes | DefaultJobPriority | SingleNode | AllowedDatabasesPaths |
---|---|---|---|---|---|---|---|---|---|---|
win\amd64 | \\fdygitrunner\BoSSStests | True | MSHPC-AllNodes-test | dotnet | FDY\jenkinsci | DC2 | [ hpccluster, hpccluster2, hpcluster3, hpccluster4, fdygitrunner ] | Normal | True | [ \\fdygitrunner\BoSSStests ] |
We verify that we have no jobs defined so far ...
BoSSSshell.WorkflowMgm.AllJobs
// the folloowing line is part of the trest system and not neccesary in User worksheets:
NUnit.Framework.Assert.IsTrue(BoSSSshell.WorkflowMgm.AllJobs.Count == 0, "MetaJobManager tutorial: expecting 0 jobs on entry.");
The initialization of the Workflow Management environment already creates, resp. opens a BoSSS database with the same name as the project name as the project. The current default database is set as:
wmg.DefaultDatabase
{ Session Count = 0; Grid Count = 0; Path = \\fdygitrunner\BoSSStests\MetaJobManager_Tutorial }
// From previous versions of the code, not required anymore:
//var myLocalDb = myBatch.CreateTempDatabase();
BatchProcessorClient.CreateTempDatabase()
,
resp. BatchProcessorClinet.CreateOrOpenCompatibleDatabase(...)
.
(as demonstrated below) ensures that the database is in a directory which can be accessed by the batch system.
(Alternative functions, i.e. BoSSSshell.CreateTempDatabase()
or BoSSSshell.OpenOrCreateDatabase(...)
do not guarantee this and the user has to ensure an appropriate location.All currently opened databases can be listed using:
databases
#0: { Session Count = 0; Grid Count = 0; Path = \\fdygitrunner\BoSSStests\MetaJobManager_Tutorial }
As an example, we use the workflow management tools to simulate incompressible channel flow, therefore we have to import the namespace, and repeat the steps from the IBM example (Tutorial 2) in order to setup the control object:
using BoSSS.Application.XNSE_Solver;
We create a grid with boundary conditions:
var xNodes = GenericBlas.Linspace(0, 10 , 41);
var yNodes = GenericBlas.Linspace(-1, 1, 9);
GridCommons grid = Grid2D.Cartesian2DGrid(xNodes, yNodes);
grid.DefineEdgeTags(delegate (double[] X) {
double x = X[0];
double y = X[1];
if (Math.Abs(y - (-1)) <= 1.0e-8)
return "wall"; // lower wall
if (Math.Abs(y - (+1)) <= 1.0e-8)
return "wall"; // upper wall
if (Math.Abs(x - (0.0)) <= 1.0e-8)
return "Velocity_Inlet"; // inlet
if (Math.Abs(x - (+10.0)) <= 1.0e-8)
return "Pressure_Outlet"; // outlet
throw new ArgumentOutOfRangeException("unknown domain");
});
Grid Edge Tags changed.
One can save this grid explicitly to a database, but it is not a must; The grid should be saved automatically, when the job is activated.
//wmg.DefaultDatabase.SaveGrid(ref grid);
Next, we create the control object for the incompressible simulation:
var c = new XNSE_Control();
// general description:
int k = 1;
string desc = "Steady state, channel, k" + k;
c.SessionName = "SteadyStateChannel";
c.ProjectDescription = desc;
c.savetodb = true;
c.Tags.Add("k" + k);
// setting the grid:
c.SetGrid(grid);
// DG polynomial degree
c.SetDGdegree(k);
// Physical parameters:
double reynolds = 20;
c.PhysicalParameters.rho_A = 1;
c.PhysicalParameters.mu_A = 1.0/reynolds;
// Timestepping properties:
c.TimesteppingMode = AppControl._TimesteppingMode.Steady;
Warning: grid seems not to be saved in a database
The specification of boundary conditions and initial values is a bit more complicated if the job manager is used:
Since the solver is executed in an external program, the control object has to be saved in a file. For lots of complicated objects, especially for delegates, C# does not support serialization (converting the object into a form that can be saved on disk, or transmitted over a network), so a workaround is needed. This is achieved e.g. by the Formula object, where a C#-formula is saved as a string.
var WallVelocity = new Formula("X => 0.0", false); // 2nd Argument=false says that its a time-indep. formula.
Testing the formula:
WallVelocity.Evaluate(new[]{0.0, 0.0}, 0.0) // evaluationg at (0,0), at time 0
// [Deprecated]
/// A disadvantage of string-formulas is that they look a bit ``alien''
/// within the worksheet; therefore, there is also a little hack which allows
/// the conversion of a static memeber function of a static class into a
/// \code{Formula} object:
// Deprecated, this option is no longer supported in .NET5
static class StaticFormulas {
public static double VelX_Inlet(double[] X) {
//double x = X[0];
double y = X[0];
double UX = 1.0 - y*y;
return UX;
}
public static double VelY_Inlet(double[] X) {
return 0.0;
}
}
// InletVelocityX = GetFormulaObject(StaticFormulas.VelX_Inlet);
//var InletVelocityY = GetFormulaObject(StaticFormulas.VelY_Inlet);
var InletVelocityX = new Formula("X => 1 - X[0]*X[0]", false);
var InletVelocityY = new Formula("X => 0.0", false);
Finally, we set boundary values for our simulation. The initial values are set to zero per default; for the steady-state simulation initial values are irrelevant anyway:
Initial Values are set to 0
c.BoundaryValues.Clear();
c.AddBoundaryValue("wall", "VelocityX", WallVelocity);
c.AddBoundaryValue("Velocity_Inlet", "VelocityX", InletVelocityX);
c.AddBoundaryValue("Velocity_Inlet", "VelocityY", InletVelocityY);
c.AddBoundaryValue("Pressure_Outlet");
Finally, we are ready to deploy the job at the batch processor; In a usual work flow scenario, we do not want to (re-) submit the job every time we run the worksheet -- usually, one wants to run a job once.
The concept to overcome this problem is job activation. If a job is activated, the meta job manager first checks the databases and the batch system, if a job with the respective name and project name is already submitted. Only if there is no information that the job was ever submitted or started anywhere, the job is submitted to the respective batch system.
First, a `Job* -object is created from the control object:
var JobLocal = c.CreateJob();
This job is not activated yet, it can still be configured:
JobLocal.Status
// Test:
NUnit.Framework.Assert.IsTrue(JobLocal.Status == JobStatus.PreActivation);
One can change e.g. the number of MPI processes:
JobLocal.NumberOfMPIProcs = 1;
Note that these jobs are desigend to be persistent: This means the computation is only started once for a given control object, no matter how often the worksheet is executed.
Such a behaviour is useful for expensive simulations, which run on HPC servers over days or even weeks. The user (you) can close the worksheet and maybe open and execute it a few days later, and he can access the original job which he submitted a few days ago (maybe it is finished now).
Then, the job is activated, resp. submitted, resp. deployed to one batch system. If job persistency is not wanted, traces of the job can be removed on request during activation, causing a fresh job deployment at the batch system:
JobLocal.Activate(); // execute the job in the default execution queue
//JobLocal.Activate(ExecutionQueues[4]); // execute the job e.g. in queue 4
Deployments so far (0): ; Success: 0 job submit count: 0
unable to determine job status - unknown
Deploying job SteadyStateChannel ... Creating database '\\fdygitrunner\ValidationTests\databases\MetaJobManager_Tutorial'. Set Database: { Session Count = 0; Grid Count = 0; Path = \\fdygitrunner\BoSSStests\MetaJobManager_Tutorial } Grid is not in database yet... Grid successfully saved: da8a23e2-5699-472f-9e6e-d86e0f988709
Warning: no database is set for the job to submit; nothing may be saved.
Deploying executables and additional files ... Deployment directory: \\fdygitrunner\BoSSStests\MetaJobManager_Tutorial-XNSE_Solver2023Dec05_015458.656431 copied 42 files. written file: control.obj copied 'win\amd64' runtime. deployment finished.
All jobs can be listed using the workflow management:
BoSSSshell.WorkflowMgm.AllJobs
#0: SteadyStateChannel: InProgress (MS HPC client MSHPC-AllNodes-test @DC2, @\\fdygitrunner\BoSSStests) SteadyStateChannel: InProgress (MS HPC client MSHPC-AllNodes-test @DC2, @\\fdygitrunner\BoSSStests)
Check the present job status:
JobLocal.Status
/// BoSSScmdSilent BoSSSexeSilent
NUnit.Framework.Assert.IsTrue(
JobLocal.Status == JobStatus.PendingInExecutionQueue
|| JobLocal.Status == JobStatus.InProgress
|| JobLocal.Status == JobStatus.FinishedSuccessful);
Here, we block until both of our jobs have finished:
BoSSSshell.WorkflowMgm.BlockUntilAllJobsTerminate(1000);
All jobs finished.
We examine the output and error stream of the job:
This directly accesses the \tt stdout-redirection of the respective job
manager, which may contain a bit more information than the
Stdout
-copy in the session directory.
JobLocal.Stdout
Session ID: eb90c593-cf20-42c3-8d3d-20329b3029a3, DB path: '\\fdygitrunner\BoSSStests\MetaJobManager_Tutorial' Session directory '\\fdygitrunner\BoSSStests\MetaJobManager_Tutorial\sessions\eb90c593-cf20-42c3-8d3d-20329b3029a3'. Grid repartitioning method: METIS Grid repartitioning options: Number of cell Weights: 0 Going with agglomeration threshold: 0.1 Linearization hint: AdHoc =============== Operator Configuration =============== isGravity :[ ] isVolForce :[ ] isTransport :[ ] isViscous :[x] isPressureGradient :[x] isInterfaceSlip :[ ] isContinuity :[x] isMovingMesh :[ ] isMatInt :[x] isPInterfaceSet :[ ] isImmersedBoundary :[ ] withPressureDissipation :[ ] =============== Linear Solver Configuration =============== Solvercode :Sparse direct solver PARDISO =============== Nonlinear Solver Configuration =============== Solvercode :Newton Convergence Criterion :0 Globalization :Dogleg Minsolver Iterations :4 Maxsolver Iterations :2000 ====================================================== Level-Set field Phi is **exactly** zero: setting entire field to -1. All Cells: min=320 max=320 avg=320 inb=0 tot=320 Cut Cells: min=0 max=0 avg=0 inb=0, tot=0 Starting time step 1, dt = 1.7976931348623158E+304 ... #Line, #Time, #Iter L2Norm MomentumX L2Norm MomentumY L2Norm ContiEq L2Norm Total 1, 1, 2 3.491886E-014 2.034952E-014 5.162442E-015 4.074408E-014 Done with time step 1; solver success: True Removing tag: NotTerminated
Additionally we display the error stream and hope that it is empty:
JobLocal.Stderr
We can also obtain the session which was stored during the execution of the job:
var Sloc = JobLocal.LatestSession;
Sloc
MetaJobManager_Tutorial SteadyStateChannel 12/05/2023 01:55:23 eb90c593...
We can also list all attempts to run the job at the assigend processor:
JobLocal.AllDeployments
#0: Job token: 838819, FinishedSuccessful 'MetaJobManager_Tutorial-XNSE_Solver2023Dec05_015458.656431' @ MS HPC client MSHPC-AllNodes-test @DC2, @\\fdygitrunner\BoSSStests
NUnit.Framework.Assert.IsTrue(JobLocal.AllDeployments.Count == 1, "MetaJobManager tutorial: Found more than one deployment.");
Finally, we check the status of our jobs:
JobLocal.Status
If anything failed, hints on the reason why are provides by the GetStatus
method:
JobLocal.GetStatus(WriteHints:true)
Deployments so far (1): (Job token: 838819, FinishedSuccessful 'MetaJobManager_Tutorial-XNSE_Solver2023Dec05_015458.656431' @ MS HPC client MSHPC-AllNodes-test @DC2, @\\fdygitrunner\BoSSStests, FinishedSuccessful); Success: 1 Info: Found successful session "MetaJobManager_Tutorial SteadyStateChannel 12/05/2023 01:55:23 eb90c593..." -- job is marked as successful, no further action.
NUnit.Framework.Assert.IsTrue(JobLocal.Status == JobStatus.FinishedSuccessful, "MetaJobManager tutorial: Job was not successful.");
Each run of the solver corresponds to one session in the database. A session is basically a collection of information on the entire solver run, i.e. the simulation result, input and solver settings as well as meta-data such as computer and daten and time.
Since in this tutorial only one solver run was executed, there is only one session in the Workflow Management (wmg
is just an alias for BoSSSshell.WorkflowMgm.
):
wmg.Sessions
#0: MetaJobManager_Tutorial SteadyStateChannel 12/05/2023 01:55:23 eb90c593...
We select the first (and only) session and create an export instruction object. The supersampling setting increases the output resolution. This is required to vizualize high-order DG ploynomials with the low-order Tecplot-format. Tecplot can only vizualize a linear interpolation within a cell. With a second-degree supersampling, each cell is subdivided twice (in 2D, one subdivision is 4 cells, i.e. 2 subdivisions are $4^2 = 16$ cells). In this way, the curve of e.g. a secondd order polynomial can be represented with the linear interpolation over 16 cells.
var outPath = wmg.Sessions[0].Export().WithSupersampling(2).Do();
Starting export process... Data will be written to the directory: C:\Users\jenkinsci\AppData\Local\BoSSS\plots\sessions\MetaJobManager_Tutorial__SteadyStateChannel__eb90c593-cf20-42c3-8d3d-20329b3029a3
On the respective directory (see output above) one should finaaly find plot-files which than can be used for further post processing in third-party software such as Paraview, LLNL Visit or Tecplot.
The Do()
command returns the location of the output files:
outPath
C:\Users\jenkinsci\AppData\Local\BoSSS\plots\sessions\MetaJobManager_Tutorial__SteadyStateChannel__eb90c593-cf20-42c3-8d3d-20329b3029a3
To finalize this tutorial, we list all files in the plot output directory:
System.Threading.Thread.Sleep(10000); // just wait for the external plot application to finish
System.IO.Directory.GetFiles(outPath, "*").Select(fullPath => System.IO.Path.GetFileName(fullPath))
index | value |
---|---|
0 | plotConfig.xml |
1 | state_0.0.2.plt |
2 | state_0.0.plt |
3 | state_0.plt |
4 | state_1.plt |