CMAQ on Azure Tutorial#
Warning
This documentation is under continuous development. Previous version is available here: CMAQv5.3.3 on Azure Tutorial
Community Multiscale Air Quality Model#
The Community Multiscale Air Quality (CMAQ) modeling system an active open-source development project of the U.S. EPA. The CMAQ system is a Linux-based suite of models that requires significant computational resources and specific system configurations to run. CMAQ combines current knowledge in atmospheric science and air quality modeling, multi-processor computing techniques, and an open-source framework to deliver fast, technically sound estimates of ozone, particulates, toxics and acid deposition.
For additional background on CMAQ please visit the U.S. EPA CMAQ Website.
CMAQ is a community modeling effort that is supported by the Community Modeling and Analysis System (CMAS) Center at the University of North Caroline at Chapel Hill.
Tutorial Overview#
This document provides tutorials and information on using Microsoft Azure Online Portal to create either a single Virtual Machine or a Cycle Cloud Cluster to run CMAQ. The tutorials are aimed at users with cloud computing experience that are already familiar with Azure. For those with no cloud computing experience we recommend reviewing the Additional Resources listed in chapter 15 of this document.
This document provides three hands-on tutorials that are designed to be read in order. The Introductory Tutorial will walk you through setting up an Azure Account and logging into the Azure Portal Website. You will learn how to set up your Azure Resource ID, configure and create a demo virtual machine, and exit and delete the virtual machine and all of the resources associated with it by deleting resource group. The Intermediate Tutorial steps you through running a CMAQ test case on a single Virtual Machine with instructions to install CMAQ, libraries, and input data. The Advanced Tutorial explains how to create a CycleCloud (High Performance Cluster) for larger compute jobs and install CMAQ, requried libraries and input data. The remaining sections provide instructions on post-processing CMAQ output, comparing output and runtimes from multiple simulations, and copying output from CycleCloud to an Amazon Web Services (AWS) Simple Storage Service (S3) bucket.
GMD Paper#
Efstathiou, C. I., Adams, E., Coats, C. J., Zelt, R., Reed, M., McGee, J., Foley, K. M., Sidi, F. I., Wong, D. C., Fine, S., and Arunachalam, S.: Enabling high-performance cloud computing for the Community Multiscale Air Quality Model (CMAQ) version 5.3.3: performance evaluation and benefits for the user community, Geosci. Model Dev., 17, 7001–7027, https://doi.org/10.5194/gmd-17-7001-2024, 2024. <a href=”https://gmd.copernicus.org/articles/17/7001/2024/gmd-17-7001-2024.pdf>Enabling high-performance cloud computing for the Community Multiscale Air Quality Model (CMAQ) version 5.3.3: performance evaluation and benefits for the user community”
Azure Subscriptions#
The ability to use resources available in the Microsoft Azure Cloud is limited by quotas that are set at the subscription level. This tutorial was developed using UNC Chapel Hill’s Enterprise account. Additional effort is being made to identify how to use a pay-as-you-go account, but these instructions have not been finalized. There may also be differences in how managed identies and user level permissions are set by the administrator of your enterprise level account that are not covered in this tutorial.
Why might I need to use Azure Virtual Machine or CycleCloud?#
An Azure Virtual Machine may be configured to run code compiled with Message Passing Interface (MPI) on a single high performance compute node. The intermediate tutorial demonstrates how to run CMAQ interactively on a single virtual machine running CMAQ with OpenMPI on multiple cpus.
The Azure CycleCloud may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes/virtual machines using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the slurm scheduler is that the number of compute nodes that will be provisioned can be adjusted to meet requirements of a given simulation. In addition, the user can reduce costs by using Spot instances rather than On-Demand for the compute nodes. CycleCloud also supports submitting multiple jobs to the job submission queue.
Our goal is make this user guide to running CMAQ on either a single Virtual Machine or the CycleCloud Cluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.
Additional information on Azure CycleCloud:
Contents:
- 1. Introductory Tutorial
- 2. System Requirements
- 3. Create Single VM using HB120rs_v3 Tutorial
- 3.1. Create a HB120rs_v3 Virtual Machine
- 3.2. Login to the Virtual Machine
- 3.3. Mount the disk on the server as /shared using the instructions on the following link:
- 3.4. Alternatively, you can create an nvme stripped disk that has faster performance.
- 3.5. Download the Input data from the S3 Bucket
- 3.6. Change shell to use tcsh
- 3.7. Create Environment Module for Libraries
- 3.8. Install and Build CMAQ
- 3.9. Copy the run scripts from the repo to the run directory
- 3.10. Run CMAQ interactively using the following command:
- 3.11. Created another single VM using HBv120_v2 and ran again
- 3.12. Created another VM using the HB120v3 cpus
- 3.13. Verify that the correct number of cpus are installed using lscpu
- 3.14. Timing information
- 3.15. Review performance metrics in the Azure portal
- 3.16. IF your performance is much slower than this, then we recommend that you terminate the resource group and re-build the VM
- 4. Create CycleCloud HB120rs_v3 Cluster
- 5. CMAQv5.4+ Benchmark on HBv3_120 compute nodes and beeond
- 5.1. Use Cycle Cloud with CMAQv5.4+ software and 12US1 Benchmark data.
- 5.2. Log into the new cluster
- 5.3. Download the input data from the AWS Open Data CMAS Data Warehouse using the aws copy command.
- 5.4. Verify Input Data
- 5.5. Install CMAQv5.4+
- 5.6. Copy and Examine CMAQ Run Scripts
- 5.7. Submit Job to Slurm Queue to run CMAQ on beeond
- 5.8. submit job to run on 1 node x 96 processors
- 5.9. Submit job to run on 3 nodes
- 5.10. Check how quickly the processing is being completed
- 5.11. Check results when job has completed successfully
- 5.12. Check to see if spot VMs are available
- 5.13. Unsuccessful slurm status messages
- 5.14. Change to HB176_v4 compute node
- 5.15. To recover from failure use the terminate cluster option
- 5.16. If SLURM jobs are in a bad state
- 5.17. Run DESID CMAQ on hbv3_120 using the beeond filesystem
- 6. Scripts to run combine and post processing
- 7. Scripts to post-process CMAQ output
- 8. Install R, Rscript and Packages
- 9. Install Anaconda on the /shared/build directory
- 10. QA CMAQ
- 11. Compare Timing of CMAQ Routines
- 12. Copy Output to S3 Bucket
- 13. Logout and Delete CycleCloud
- 14. Performance Optimization
- 14.1. Right-sizing Compute Nodes for a Single Virtual Machine.
- 14.2. An explanation of why a scaling analysis is required for Single Node
- 14.3. Benchmark Scaling Plots using Single Virtual Machine HBv120
- 14.4. Right-sizing Compute Nodes for the CycleCloud
- 14.5. An explanation of why a scaling analysis is required for Multinode or Parallel MPI Codes
- 14.6. Slurm Compute Node Provisioning
- 14.7. Benchmark Scaling Plots using CycleCloud
- 15. Additional Resources
- 16. Future Work
- 17. Contribute to this Tutorial
- 18. Optional instructions for Creating CycleCloud Cluster using /shared and /lustre