Warning
This documentation is under continuous development
Overview#
This document provides tutorials and information on using Microsoft Azure Online Portal to create either a single Virtual Machine or a Cycle Cloud Cluster to run CMAQ. The tutorials are aimed at users with cloud computing experience that are already familiar with Azure. For those with no cloud computing experience we recommend reviewing the Additional Resources listed in chapter 13 of this document.
Format of this documentation#
This document provides three hands-on tutorials that are designed to be read in order. The Introductory Tutorial will walk you through setting up an Azure Account and logging into the Azure Portal Website. You will learn how to set up your Azure Resource ID, configure and create a demo virtual machine, and exit and delete the virtual machine and all of the resources associated with it by deleting resource group. The Intermediate Tutorial steps you through running a CMAQ test case on a single Virtual Machine with instructions to install CMAQ, libraries, and input data. The Advanced Tutorial explains how to create a CycleCloud (High Performance Cluster) for larger compute jobs and install CMAQ, requried libraries and input data. The remaining sections provide instructions on post-processing CMAQ output, comparing output and runtimes from multiple simulations, and copying output from CycleCloud to an Amazon Web Services (AWS) Simple Storage Service (S3) bucket.
Azure Subscriptions#
The ability to use resources available in the Microsoft Azure Cloud is limited by quotas that are set at the subscription level. This tutorial was developed using UNC Chapel Hill’s Enterprise account. Additional effort is being made to identify how to use a pay-as-you-go account, but these instructions have not been finalized. There may also be differences in how managed identies and user level permissions are set by the administrator of your enterprise level account that are not covered in this tutorial.
Why might I need to use Azure Virtual Machine or CycleCloud?#
An Azure Virtual Machine may be configured to run code compiled with Message Passing Interface (MPI) on a single high performance compute node. The intermediate tutorial demonstrates how to run CMAQ interactively on a single virtual machine running CMAQ with OpenMPI on multiple cpus.
The Azure CycleCloud may be configured to be the equivalent of a High Performance Computing (HPC) environment, including using job schedulers such as Slurm, running on multiple nodes/virtual machines using code compiled with Message Passing Interface (MPI), and reading and writing output to a high performance, low latency shared disk. The advantage of using the slurm scheduler is that the number of compute nodes that will be provisioned can be adjusted to meet requirements of a given simulation. In addition, the user can reduce costs by using Spot instances rather than On-Demand for the compute nodes. CycleCloud also supports submitting multiple jobs to the job submission queue.
Our goal is make this user guide to running CMAQ on either a single Virtual Machine or the CycleCloud Cluster as helpful and user-friendly as possible. Any feedback is both welcome and appreciated.
Additional information on Azure CycleCloud:
Contents:
- 1. Introductory Tutorial
- 2. System Requirements
- 3. Intermediate Tutorial
- 3.1. Create a HB120rs_v2 Virtual Machine
- 3.2. Login to the Virtual Machine
- 3.3. Mount the disk on the server as /shared using the instructions on the following link:
- 3.4. Alternatively, you can create an nvme stripped disk that has faster performance.
- 3.5. Obtain the Cyclecloud-cmaq code from github
- 3.6. Change shell to use tcsh
- 3.7. Create Environment Module for Libraries
- 3.8. Install and Build CMAQ
- 3.9. Copy the run scripts from the repo to the run directory
- 3.10. Download the Input data from the S3 Bucket
- 3.11. Run CMAQ interactively using the following command:
- 4. Advanced Tutorial
- 4.1. Create Cyclecloud CMAQ Cluster
- 4.2. Modify Cyclecloud CMAQ Cluster
- 4.3. Install CMAQ and pre-requisite libraries on linux
- 4.4. Configuring selected storage and obtaining input data
- 4.5. Copy the run scripts from the CycleCloud repo
- 4.6. Run the CONUS Domain on 180 pes
- 4.7. Check the status in the queue
- 4.8. check the timings while the job is still running using the following command
- 4.9. When the job has completed, use tail to view the timing from the log file.
- 4.10. Check whether the scheduler thinks there are cpus or vcpus
- 5. Scripts to run combine and post processing
- 6. Scripts to post-process CMAQ output
- 7. Install R, Rscript and Packages
- 8. QA CMAQ
- 9. Compare Timing of CMAQ Routines
- 10. Copy Output to S3 Bucket
- 11. Logout and Delete CycleCloud
- 12. Performance Optimization
- 12.1. Right-sizing Compute Nodes for a Single Virtual Machine.
- 12.2. An explanation of why a scaling analysis is required for Single Node
- 12.3. Benchmark Scaling Plots using Single Virtual Machine HBv120
- 12.4. Right-sizing Compute Nodes for the CycleCloud
- 12.5. An explanation of why a scaling analysis is required for Multinode or Parallel MPI Codes
- 12.6. Slurm Compute Node Provisioning
- 12.7. Benchmark Scaling Plots using CycleCloud
- 13. Additional Resources
- 14. Future Work
- 15. Contribute to this Tutorial