Performance Optimization for Single Virtual Machine

14.1. Right-sizing Compute Nodes for a Single Virtual Machine.#

Selection of the compute nodes depends on the domain size and resolution for the CMAQ case, and what your model run time requirements are. Larger hardware and memory configurations may also be required for instrumented versions of CMAQ incuding CMAQ-ISAM and CMAQ-DDM3D. Running on a single virtual machine requires that the user know how CMAQ scales for the domain of interest.

14.2. An explanation of why a scaling analysis is required for Single Node#

Quote from the following link.

“IMPORTANT: The optimal value of –nodes and –ntasks for a parallel code must be determined empirically by conducting a scaling analysis. As these quantities increase, the parallel efficiency tends to decrease. The parallel efficiency is the serial execution time divided by the product of the parallel execution time and the number of tasks. If multiple nodes are used then in most cases one should try to use all of the CPU-cores on each node.”

Note

For the scaling analysis that was performed with CMAQ, the parallel efficiency was determined as the runtime for the smallest number of CPUs divided by the product of the parallel execution time and the number of additional cpus used. If smallest NPCOLxNPROW configuration was 18 cpus, the run time for that case was used, and then the parallel efficiency for the case where 36 cpus were used would be parallel efficiency = runtime_18cpu/(runtime_36cpu*2)*100

Azure HBv3-120 Pricing

Azure HPC HBv3_120pe Pricing

Table 1. Azure Instance On-Demand versus Spot Pricing (price is subject to change)

Instance Name

CPUs

RAM

Memory Bandwidth

Network Bandwidth

Linux On-Demand Price

Linux Spot Price

HBv3-120

120

448 GiB

350 Gbps

200 Gbps(Infiniband)

$3.6/hour

$.36/hour

Table 2. Timing Results for CMAQv5.4+ 2 Day CONUS2 Run on Single Virtual Machine HBv120 (120 cpu per node) I/O on /shared directory (UPDATE)

CPUs

NodesbyCPU

NPCOLxNPROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

CPU Hours/day

SBATCH –exclusive

Data Imported or Copied

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

compiler flag

i/o dir

16

1x16

4x4

10374.66

9310.67

19685.33

2.734

no

copied

$.36/hr * 1 nodes * 5.468 =

7.87

3.6/hr * 1 nodes * 5.468 =

19.68

with -march=native compiler flag

shared/data

36

1x36

6x6

5102.89

4714.96

9817.85

1.36

no

copied

$.36/hr * 1 nodes * 2.72 =

3.92

3.6/hr * 1 nodes * 2.72 =

9.79

with -march=native compiler flag

/shared/data

72

1x72

8x9

3130.73

2747.3

5878.03

.815

no

copied

$.36/hr * 1 nodes * 1.63 =

2.35

3.6/hr * 1 nodes * 1.63 =

5.87

with -march=native compiler flag

/shared/data

90

1x90

9x10

2739.38

2417.26

5156.64

.715

no

copied

$.36/hr * 1 nodes * 1.43 =

2.06

3.6/hr * 1 nodes * 1.43 =

5.15

with -march=native compiler flag

/shared/data

120

1x120

10x12

2646.52

2374.21

5020.73

.6973

no

copied

$.36/hr * 1 nodes * 1.3946 =

2.01

3.6/hr * 1 nodes * 1.39 =

5.00

with -march=native compiler flag

/shared/data

Total HBv3-120 compute cost of Running Benchmarking Suite using SPOT pricing = $1.4/hr

Total HBv3-120 compute cost of Running Benchmarking Suite using ONDEMAND pricing = $3.6/hr

Savings is ~ 60% for spot versus ondemand pricing for HBv3-120 compute nodes.

Azure Spot and On-Demand Pricing

14.3. Benchmark Scaling Plots using Single Virtual Machine HBv120#

Figure 1. Plot of Time and On Demand Cost versus CPU

Plot of Total Time and On Demand Cost versus CPUs for HBv120 using ggplot