Performance Optimization for Cycle Cloud

14.4. Right-sizing Compute Nodes for the CycleCloud#

Selection of the compute nodes (virtual machines) depends on the domain size and resolution for the CMAQ case, the CMAQ Version, the model run time requirements, and the disks (beeond, lustre, or shared) used for input and output. Larger hardware and memory configurations may also be required for instrumented versions of CMAQ incuding CMAQ-ISAM and CMAQ-DDM3D. The CycleCloud allows you to run the compute nodes only as long as the job requires, and you can also update the compute nodes as needed for your domain.

14.5. An explanation of why a scaling analysis is required for Multinode or Parallel MPI Codes#

Quote from the following link.

“IMPORTANT: The optimal value of –nodes and –ntasks for a parallel code must be determined empirically by conducting a scaling analysis. As these quantities increase, the parallel efficiency tends to decrease. The parallel efficiency is the serial execution time divided by the product of the parallel execution time and the number of tasks. If multiple nodes are used then in most cases one should try to use all of the Cores on each node.”

Note

For the scaling analysis that was performed with CMAQ, the parallel efficiency was determined as the runtime for the smallest number of Cores divided by the product of the parallel execution time and the number of additional cpus used. If smallest NPCOLxNPROW configuration was 18 cpus, the run time for that case was used, and then the parallel efficiency for the case where 36 cpus were used would be parallel efficiency = runtime_18cpu/(runtime_36cpu*2)*100

14.6. Slurm Compute Node Provisioning#

Azure CycleCloud relies on SLURM to make the job allocation and scaling decisions. The jobs are launched, terminated, and resources maintained according to the Slurm instructions in the CMAQ run script. The CycleCloud Web Interface is used to set the identity of the head node and the compute node, and the maximum number of compute nodes that can be submitted to the queue.

Number of compute nodes dispatched by the slurm scheduler is specified in the run script using #SBATCH –nodes=XX #SBATCH –ntasks-per-node=YY where the maximum value of tasks per node or YY limited by many Cores are on the compute node.

As an example:

For HB120rs_v3, there are 120 Cores/node, so maximum value of YY is 120 or –ntask-per-node=120 For many of the runs that were done, we set –ntask-per-node=96 so that we could compare to the Parallel Cluster, and to avoid oversubscribing the cores.

If running a job with 192 processors, this would require the –nodes=XX or XX to be set to 2 compute nodes, as 96x2=192.

The setting for NPCOLxNPROW must also be a maximum of 192, ie. 16 x 12 or 12 x 16 to use all of the Cores in the Cycle Cloud HPC Node.

If running a job with 240 processors, this would require the –nodes=XX or XX to be set to 2 compute nodes, as 120x2=240.

Azure HBv3-120 Pricing

Azure HPC HBv3_120pe Pricing

Table 1. Azure Instance On-Demand versus Spot Pricing (price is subject to change)

Instance Name

Cores

RAM

Memory Bandwidth

Network Bandwidth

Linux On-Demand Price

Linux Spot Price

HB120rs_v3

120

448 GiB

350 Gbps

200 Gbps(Infiniband)

$3.6/hour

$.36

HB176_v4

176

656 GiB

780 Gbps

400 Gbps(Infiiniband)

$7.2/hour

$.41

Azure HBv3-series Specifications Azure HBv4-series Specifications

Note, check to see what processors were used

lscpu

Core (logs in March 2023)

Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7763 64-Core Processor
Stepping:            1
CPU MHz:             3021.872
BogoMIPS:            4890.85
Virtualization:      AMD-V
Hypervisor vendor:   Microsoft
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-3

Table 2. Timing Results for CMAQv5.4+ 2 Day 12US1 (CONUS) Run on Cycle Cloud with D12v2 schedulare node and HBv3-120 Compute Nodes (120 cpu per node), I/O on /mnt/beeond

Cores

Nodes

NodesxCores

COLROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

Compute Node Hours/day

SBATCHexclusive

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

compiler flag

InputData

cpuMhz

96

1

1x96

8x12

3278.9

3800.7

7079.60

.983

no

$.36/hr * 1 nodes * 1.966 hr =

.708

$3.6/hr * 1 nodes * 1.966 hr =

7.077

without -march=native compiler flag

Beeond

3021.872

192

2

2x96

16x12

2027.8

2241.6

4269.40

.593

no

$.36/hr * 2 nodes * 1.186 hr =

.854

$3.6/hr * 2 nodes * 1.186 hr =

8.54

without -march=native compiler flag

Beeond

3021.872

288

3

3x96

16x18

1562.7

1692.6

3255.30

.452

no

$.36/hr * 3 nodes * .904 hr =

.976

$3.6/hr * 3 nodes * .904 hr =

9.76

without -march=native compiler flag

Beeond

3021.872

384

4

4x96

16x24

1356.5

1474.2

2830.70

.393

no

$.36/hr * 4 nodes * .786 hr =

1.13

$3.6/hr * 4 nodes * .786 hr =

11.3

without -march=native compiler flag

Beeond

3021.872

Total HBv3-120 compute cost of Running Benchmarking Suite using SPOT pricing = .36/hr

Total HBv3-120 compute cost of Running Benchmarking Suite using ONDEMAND pricing = $3.6/hr

For CentosOS or Ubuntu Linux in East US Region.

Savings is ~ 90% for spot versus ondemand pricing for HBv3-120 compute nodes.

Azure Spot and On-Demand Pricing Azure Spot and On-Demand Pricing

Table 3. Timing Results for CMAQv5.4+ 2 Day 12US1 (CONUS) Run on Cycle Cloud with D12v2 schedulare node and HBv2-120 Compute Nodes (120 cores per node), I/O on /mnt/beeond

Cores

Nodes

NodesxCore

COLROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

Compute Node Hours/day

SBATCHexclusive

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

compiler flag

InputData

Pin

96

1

1x96

12x8

3400.95

3437.91

6838.86

.950

no

$1.89/hr * 1 nodes * $.36 =

$.68

1.89/hr * 1 nodes * 3.6 =

6.804

no

Beeond

no

192

2

2x96

16x12

1954.62

1920.57

3875.19

.538

no

$1.07/hr * 2 nodes * $.36 =

$.77

1.07/hr * 2 nodes * 3.6 =

7.704

no

Beeond

no

Table 4. Timing Results for CMAQv5.4+ 2 Day 12US1 (CONUS) Run on Cycle Cloud with D12v2 schedulare node and HB176_v3 Compute Nodes (176 cores per node), I/O using Beeond

Cores

Nodes

NodesxCore

COLROW

Day1 Timing (sec)

Day2 Timing (sec)

TotalTime

Compute Node Hours/day

SBATCHexclusive

Equation using Spot Pricing

SpotCost

Equation using On Demand Pricing

OnDemandCost

compiler flag

InputData

Pin

160

1

1x176

16x10

2062.9

2235.3

4298.2

0.597

no

$1.19/hr * 1 nodes * $.41 =

$.4879

1.19/hr * 1 nodes * 7.2 =

8.568

no

beeond

no

320

2

2x176

16x20

1644.4

1728.3

3372.7

0.468

no

$.938/hr * 1 nodes * $.41 =

$.769

.938/hr * 2 nodes * 7.2 =

13.51

no

beeond

no

14.7. Benchmark Scaling Plots using CycleCloud#

Note: the numbers on the plot surrounded by a box indicates the number of Compute Nodes or Virtual Machines.

Figure 1. Plot of Scaling per Core

Scaling per Core for HB120_v3 (120cores/node) and HB176_v4 (176 cores/node

Figure 2. Plot of Total Time and On Demand Cost versus Cores for HB120_v3 and HB176_v4

Plot of Total Time and On Demand Cost versus Cores for HB120 and HB176

Figure 3. Plot of On Demand Cost versus Total Time for HB120_v3 and HB176_v4 Plot of On Demand Cost versus Total Time for HBv120

Note CMAQ scales well up to ~ 288 Cores for the CONUS domain. As more cores are added beyond 288 cores, the CMAQ gets less efficient at using all of them.

Scheduler node D12v2 compute cost = Will be charged for the scheduler for the entire time that the CycleCloud HPC Cluster is running ( creation to deletion) = 6 hours * $0.?/hr = $ ? using spot pricing, 6 hours * $?/hr = $? using on demand pricing.

14.7.1. Annual Run Estimates#

Using 288 cpus on the Cycle Cloud Cluster, it would take 1 week to run a full year, using 3 HBv3-120 compute nodes, at a cost of $1782 using ondemand nodes, or $178.2 using interruptable spot nodes. (Note, spot nodes have not been tried yet in this tutorial.)

Table 5. Extrapolated Cost for CMAQv5.4 Annual Simulation based on 2 day 12US1 CONUS benchmark, without pinning

Virtual Machine

Nodes

Cores

SPOT $/hr

OnDemand $/hr

2 day time seconds

2 day time hours

Annual Cost Equation

Total Core hours

Annual Cost Spot

Annual Cost OnDemand

Days to Complete Annual Simulation

HB120_v3

1

96

$.36

$3.6

7079.60

1.96

1.96/2 * 365 = 359 hours/node * 1 node

359

$129

$1292

14.9

HB120_v3

2

192

$.36

$3.6

4269.40

1.19

1.19/2 * 365 = 216 hours/node * 2 nodes

432

$155.8

$1558

9

HB120_v3

3

288

$.36

$3.6

3255.3

.904

.904/2 * 365 = 165 hours/node * 3 nodes

495

$178.2

$1782

6.8

HB120_v3

4

384

$.36

$3.6

2830.70

.786

.786/2 * 365 = 143.5 hours/node * 4 nodes

574

$206.7

$2066

5.95

HB176_v4

1

160

$.41

$7.2

4298.2

1.19

1.19/2 * 365 = 217.9 hours/node * 1 node

218

$89.3

$1569

9.04`

HB176_v4

2

320

$.41

$7.2

3372.7

0.94

.94/2 * 365 = 171.55 hours/node * 2 node

343

$140.7

$2470.3

7.2

Azure SSD Disk Pricing Azure SSD Disk Pricing

Table 6. Shared SSD File System Pricing

Storage Type

Storage throughput

Max IOPS (Max IOPS w/ bursting)

Pricing (monthly)

Pricing

Price per mount per month (Shared Disk)

Persistant 1TB

200 MB/s/TB

5,000 (30,000)

$122.88/month

$6.57

14.7.2. Lustre File System Pricing#

Table 7. Lustre File System Pricing *note, there isn’t currently a method that starts and stops the Lustre Filesystem as part of the CycleCloud start and stop, so there is a danger of leaving the lustre file system on for long periods of time. It is recommended that you use the Beeond Filesystem, where we get similar performance, but at zero cost.

Storage Type

Storage throughput

Storage Capacity Availability (in multiples of)

Cost per GiB/hr

Cost/month for minimum capacity available

Standard tier

125 MB/s

16,000 GiB

.000198

$2312

Premium tier

250 MB/s

8,000 GiB

.000287

$1676

Ultra tier

500 MB/s

4,000 GiB

.000466

$1361

According to the Azure calculators, the price varies by I/O Speed, and different tiers have different minimum storage size requirements.:

Calculations:
16000 GiB (17 TB) x 730 Hours x 0.000198Per GiB/hour = $2312 / month
8000 GiB (9 TB) x 730 Hours x $0.000287 Per GiB/hour = $1676 / month
4000 GiB (4.3 TB) x 730 Hours x 0.000466Per GiB/hour = $1361 / month

CycleCloud and ParallelCluster Price Comparison of Cost Estimate for Annual Simulation (Filesystem + Compute)#

Table 8. Price Estimate for Annual Simulation (Filesystem + Compute)

Vendor

Cluster Name

Resource Type

Virtual Machine

Nodes

Cores

Minimum Storage Size (GB)

Storage Hourly Price

SPOT $/hr

OnDemand $/hr

CMAQv5.4 two-day runtime (sec)

CMAQv5.4 two-day runtime (hr)

Annual Cost Equation

Total Time (hr/node)

Annual Cost Spot

Annual Cost OnDemand

Storage Cost NFS

Storage Cost Lustre

Storage Cost Beeond

Days to Complete Annual Simulation

Total Cost for Annual Run (Lustre, Compute Node, Scheduler, NFS Storage)

Total Cost for Annual Run (Beeond, Compute Node, Scheduler, NFS Storage)

Cost Savings of using Beeond Filesystem

Azure

CycleCloud

Compute

HB120_v3

3

288

$0.36

$3.60

3255.3

0.90425

.904/2 * 365 =

165.025

$178.23

$1782

6.9

$2,462

$1847

$615

Azure

CycleCloud

Login

Standard_D8as_v4

1

8

N/A

$0.0048

6510.6

1.8085

1.805/2*365 =

330.05125

N/A

$2

Azure

CycleCloud

Scheduler

Standard_D4s_v3

1

4

N/A

$0.19

6510.6

1.8085

1.805/2*365 =

330.05125

N/A

$63

Azure

CycleCloud

NFS Storage

Premium SSD LRS

1

$0.0001100

$0.0363056

Azure

CycleCloud

Lustre Storage

Ultra tier (500 MB/s/TiB)

4000

$0.000466

$307.607765

Azure

CycleCloud

Beeond

2 * 960 GB NVMe (block)

$0

AWS

ParallelCluster

Compute

hpc7g.16xlarge

3

192

N/A

$1.68

3509.8

0.9749444444

.9749/2 * 365

177.9273611

N/A

$898

7.4

$1,006

$81.9

AWS

ParallelCluster

Scheduler

c7g.large

1

2

N/A

$0.07

7019.6

1.949888889

1.949/2*365=

355.8547222

N/A

$25.73

AWS

ParallelCluster

Shared Storage

EBS: GP3

1

$0.00010959

$0.03899812

AWS

ParallelCluster

Lustre

Scratch SSD 200 MB/s/TiB

1200

$0.00019178

$40.94749118

Assumptions for Price Estimate for Annual Simulation (Filesystem + Compute)#

  • Assuming that you have an anual simulation turn-around time requirement of < 8 days (less than 3787 seconds for 2 day benchmark)

  • Assuming you have the scheduler and filesystems available for 2 * the length of the compute node time for building CMAQ, installing input data, and copying output data to S3 bucket.

  • Note, SPOT pricing is not available for AWS hpc7g.16xlarge

  • Note, SPOT pricing is not recomended for the scheduler node

  • Note, instructions for how to use SPOT pricing on Azure are not yet available

  • Note, have not replicated using the Beeond Filesystem on AWS

  • Note, assuming Lustre Filesystem is used at least as long as the scheduler node

  • Note, Lustre Filesystem is created before Azure CycleCloud, and would need manual deletion after the run, recommend using Beeond Filesystem due to level of difficulty of provisioning Lustre Filesystem on CycleCloud

  • Assuming that you have the scheduler node running 2x longer than the compute nodes

Parallel Cluster Performance Timings and Cost Estimates are available from the CMAQ on AWS Tutorial ParallelCluster Cost Estimate