2.1. System Requirements for a Single Virtual Machine or Cycle Cloud Cluster#

2.1.1. Please set up a alarm on Azure#

Set alarm to receive an email alert if you exceed $100 per month (or what ever monthly spending limit you need). It may be possible to set up daily or weekly spending alarms as well.

2.1.2. Azure Documentation on selecting the right VM for your workloads#

Description of Azure Virtual Machines

For CMAQ, it is recommended that the user select a High Performance Compute Virtual Machine.

Virtual Machine Sizes Available on Azure

2.2. Software Requirements for CMAQ on Single VM or CycleCloud Cluster#

The software requirements to run CMAQ on Azure are split into three tiers. The first tier includes the software that is provided with the operating system, the second tier includes the libraries required by CMAQ, the third tier includes the CMAQ code and associated pre and post processors, and the third tier includes the R software and packages requried by the analysis scripts for verifying output or doing a quality assurance of CMAQ.

Tier 1: Native Operating System (OS) and associated system libraries, compilers for both Single VM or Cycle Cloud Cluster

  • Tcsh shell

  • Alma Linux Gen. 2

  • Git

  • Compilers (C, C++, and Fortran) - GNU compilers version ≥ gcc (GCC) 9.2.0 (need to use module load gcc-9.2.0)

  • MPI (Message Passing Interface) - OpenMPI ≥ 4.1.0 (need to use module load mpi/openmpi-4.1.0)

Tier 1: For the Cycle Cloud Cluster

  • Slurm Scheduler

Tier 2: additional libraries required for installing CMAQ

  • NetCDF (with C, C++, and Fortran support)

  • I/O API

Tier 3: Software distributed thru the CMAS Center

  • CMAQv5.4+

  • CMAQv5.4+ Post Processors

Tier 4: R packages and Scripts

  • R QA Scripts

2.2.1. Hardware Requirements#

2.2.2. Azure Single Virtual Machine#

Azure offers generalized, compute, and high performance machines of various sizes. The amount of memory and the number of cpus required to run CMAQ depends on the domain size and resolution of the case that is being run. For this tutorial that uses a two day run of the CONUS2 domain, a minimum size recommended is a HC44rs (44 cpus) or HBv120 (120 cpus) compute node, to allow CMAQ to be run on up to 44 or 120 cpus.

HC Series Virtual Machine Overview

Physically, an HC-series server is 2 * 24-core Intel Xeon Platinum 8168 CPUs for a total of 48 physical cores. Each CPU is a single pNUMA domain, and has unified access to six channels of DRAM. Intel Xeon Platinum CPUs feature a 4x larger L2 cache than in prior generations (256 KB/core -> 1 MB/core), while also reducing the L3 cache compared to prior Intel CPUs (2.5 MB/core -> 1.375 MB/core).

The above topology carries over to the HC-series hypervisor configuration as well. To provide room for the Azure hypervisor to operate without interfering with the VM, we reserve pCores 0-1 and 24-25 (that is, the first 2 pCores on each socket). We then assign pNUMA domains all remaining cores to the VM. Thus, the VM will see:

(2 vNUMA domains) * (22 cores/vNUMA) = 44 cores per VM

HBv3-series Specification

“An HBv3-series server features 2 * 64-core EPYC 7V73X CPUs for a total of 128 physical “Zen3” cores with AMD 3D V-Cache. Simultaneous Multithreading (SMT) is disabled on HBv3. 448 GB of RAM, and no hyperthreading with 350 GB/sec of memory bandwidth, up to 32 MB of L3 cache per core, up to 7 GB/s of block device SSD performance, and clock frequencies up to 3.675 GHz.” Quote from above link

HBv4-series Specification.

“An HBv4-series server features 2 * 96-core EPYC 9V33X CPUs for a total of 192 physical “Zen4” cores with AMD 3D-V Cache. Simultaneous Multithreading (SMT) is disabled on HBv4. These 192 cores are divided into 24 sections (12 per socket), each section containing 8 processor cores with uniform access to a 96 MB L3 cache.” Quote from above link

2.2.3. Azure CycleCloud Cluster#

Azure CycleCloud Provides the simplest way to manage HPC workloads using any scheduler (like Slurm, Grid Engine, HPC Pack, HTCondor, LSF, PBS Pro, or Symphony).

CycleCloud allows you to:

  • Deploy full clusters and other resources, including scheduler, compute VMs, storage, networking, and cache

  • Orchestrate job, data, and cloud workflows

  • Give admins full control over which users can run jobs, as well as where and at what cost

  • Customize and optimize clusters through advanced policy and governance features, including cost controls, Active Directory integration, monitoring, and reporting

  • Use your current job scheduler and applications without modification

  • Take advantage of built-in autoscaling and battle-tested reference architectures for a wide range of HPC workloads and industries

Azure CycleCloud

12US1 Benchmark Domain Description#

GRIDDESC
'12US1'
'LAM_40N97W'  -2556000.   -1728000.   12000.  12000.  459  299    1

CMAQ Domain

2.3. Storage Options#

CMAQ requires low-latency storage, especially if you are running CMAQ on a large domain and using more than 200 processors.

Azure File Storage account for premium file shares is required.

Quote from following link: “Provisioned file shares can be dynamically scaled up or down depending on your storage and IO performance characteristics. The provisioned size of the file share can be increased at any time but can be decreased only after 24 hours since the last increase. After waiting for 24 hours without a quota increase, you can decrease the share quota as many times as you like, until you increase it again. IOPS/throughput scale changes will be effective within a few minutes after the provisioned size change.”

Lustre Managed Server will be available to the public in March 2023. Files can be stored on blob storage and linked to the Lustre server and re-hydrated when needed as input for CMAQ.

Azure Premium File Shares