CMAQv4+ CONUS Benchmark Tutorial using 12US1 Domain
5.1. Use Cycle Cloud with CMAQv5.4+ software and 12US1 Benchmark data.#
Step by step instructions for running the CMAQ 12US1 Benchmark for 2 days on a Cycle Cloud using beeond parallel filesystem for input data. Note, cpus are required to create the beeond shared filesystem, to copy the data in, and copy the data out. Therefore, it is necessary to leave some cpus available for this work, and to not use all of the cpus in the CMAQ domain decomposition (NPCOLxNPROW)
Input files are *.nc (uncompressed netCDF)
5.1.1. Use the files under the following directory to set up the CycleCloud Cluster to use beeond.#
Using a modified version of the instructions available on this Blog Post, updated for the new CycleCloud version 8.5 and slurm 22.05.10.
Full Beeond: BeeGFS on Demand User Manual
Edit the cluster and use the Cloud-init option for your CycleCloud to install the code in the file: beeond-cloud-init-almalinux8.5-HBv3
Do not use the Apply to all option. Select Scheduler and copy and paste the contents of scheduler-cloud-init.sh obtained from github as follows:
wget https://raw.githubusercontent.com/CMASCenter/cyclecloud-cmaq/main/install_scripts/beeond/scheduler-cloud-init.sh
Select hpc and copy and paste the contents of hpc-cloud-init.sh into the shell
wget https://raw.githubusercontent.com/CMASCenter/cyclecloud-cmaq/main/install_scripts/beeond/hpc-cloud-init.sh
Save this setting, and then terminate and then restart the cluster.
5.2. Log into the new cluster#
To find this IP address you need to go to the webpage for your Azure CycleCloud Clusters: https://IP-address/home
Double Click Scheduler,
Under view details double, click scheduler, and a pop-up window will appear
Click on the connect button in the upper right corner.
Copy and past the login command that is provided. It will have the following syntax:
ssh -Y $USER@IP-address
Make the /shared/build directory and change ownership from root to your account.
sudo mkdir /shared/build
sudo chown $USER /shared/build
Make the /shared/cyclecloud directory and change ownership from root to your account.
sudo mkdir /shared/cyclecloud-cmaq
sudo chown $USER /shared/cyclecloud-cmaq
Install the cyclecloud-cmaq repo
cd /shared
git clone -b main https://github.com/CMASCenter/cyclecloud-cmaq.git cyclecloud-cmaq
Make the /shared/data directory and change ownership to your account
sudo mkdir /shared/data
sudo chown $USER /shared/data
Create the output directory
mkdir -p /shared/data/output
The beeond filesystem will be created using the 1.8 T nvme disks that are on the compute nodes when the run script is submitted. If you use two nodes, the shared beeond filesysetm will be a size of 3.5 T.
beegfs_ondemand 3.5T 103M 3.5T 1% /mnt/beeond
5.3. Download the input data from the AWS Open Data CMAS Data Warehouse using the aws copy command.#
Install AWS CLI to obtain data from AWS S3 Bucket
cd /shared/build
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Download the data
cd /shared/cyclecloud-cmaq/s3_scripts/
./s3_copy_nosign_2018_12US1_conus_cmas_opendata_to_shared_20171222_cb6r5_uncompressed.csh
5.4. Verify Input Data#
cd /shared/data/2018_12US1
du -h
Output
40K ./CMAQ_v54+_cb6r5_scripts
44K ./CMAQ_v54+_cracmm_scripts
1.5G ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/cmv_c1c2_12
2.3G ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/cmv_c3_12
3.3G ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/merged_nobeis_norwc
1.1G ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/othpt
990M ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/pt_oilgas
4.5M ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/ptagfire
206M ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/ptegu
14M ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/ptfire
1004K ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/ptfire_grass
944K ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/ptfire_othna
4.7G ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready/ptnonipm
14G ./emis/cb6r3_ae6_20200131_MYR/cmaq_ready
2.9G ./emis/cb6r3_ae6_20200131_MYR/premerged/rwc
2.9G ./emis/cb6r3_ae6_20200131_MYR/premerged
17G ./emis/cb6r3_ae6_20200131_MYR
60K ./emis/emis_dates
17G ./emis
2.2G ./epic
13G ./icbc/CMAQv54_2018_108NHEMI_M3DRY
17G ./icbc
26G ./met/WRFv4.3.3_LTNG_MCIP5.3.3_compressed
26G ./met
3.9G ./misc
697M ./surface
66G .
5.5. Install CMAQv5.4+#
Change directories to install and build the libraries and CMAQ
Install netCDF C and Fortran Libraries
cd /shared/cyclecloud-cmaq
./gcc_netcdf_cluster.csh
cp dot.cshrc ~/.cshrc
Execute the .cshrc shell
csh
env
Verify the LD_LIBRARY_PATH environment variable
echo $LD_LIBRARY_PATH
Output
/opt/openmpi-4.1.5/lib:/opt/gcc-9.2.0/lib64:/shared/build/netcdf/lib
Install I/O API Library
cd /shared/cyclecloud-cmaq
./gcc_ioapi_cluster.csh
Build CMAQ
./gcc_cmaqv54+.csh
5.6. Copy and Examine CMAQ Run Scripts#
Obtain a copy of the CMAQ run script that has been edited to use the /mnt/beeond shared filesystem.
cp /shared/cyclecloud-cmaq/run_scripts/CMAQ_v54+_beeond/run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.*.ncclassic.csh /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts/
Note
The time that it takes the 2 day CONUS benchmark to run will vary based on the number of CPUs used, and the compute node that is being used, and what disks are used for the I/O (shared, beeond or lustre). The timings reported below are from the beeond filesystem on HB120v3 compute nodes.
Examine how the run script is configured
cd /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts/
head -n 30 /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts/run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.2x96.ncclassic.csh
#!/bin/csh -f
## For CycleCloud 120pe
## data on /lustre data directory
## https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/LDTWKH
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=96
#SBATCH --exclusive
#SBATCH -J CMAQ
#SBATCH -o /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts/run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.192.16x12pe.2days.20171222start.2x96.log
#SBATCH -e /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts/run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.192.16x12pe.2days.20171222start.2x96.log
#SBATCH -p hpc
##SBATCH --constraint=BEEOND
###SBATCH --beeond
The “beeond start” command was added to the top of the job script, and “beeond stop” to the end. Note, the -m 2 option is specific to using two nodes, this should be modified to match the number of nodes specified in the `SBATCH --nodes=` command above:
# ===================================================================
#> Start Beeond filesystem
# ===================================================================
beeond start -P -m 2 -n /shared/home/$SLURM_JOB_USER/nodefile-$SLURM_JOB_ID -d /mnt/nvme -c /mnt/beeond -f /etc/beegfs
## Copy files to /mnt/beeond, note, it may take 5 minutes to prepare the /mnt/beeond filesystem and to copy the data
beeond-cp stagein -n ~/nodefile-$SLURM_JOB_ID -g /shared/data/2018_12US1 -l /mnt/beeond/data/2018_12US1
Note
In this run script, slurm or SBATCH requests 2 nodes, each node with 96 pes, or 2x96 = 192 pes
Verify that the NPCOL and NPROW settings in the script are configured to match what is being requested in the SBATCH commands that tell slurm how many compute nodes to provision. In this case, to run CMAQ using on 192 cpus (SBATCH –nodes=2 and –ntasks-per-node=96), use NPCOL=16 and NPROW=12.
grep NPCOL run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.2x96.ncclassic.csh
Output:
#> Horizontal domain decomposition
setenv NPCOL_NPROW "1 1"; set NPROCS = 1 # single processor setting
@ NPCOL = 16; @ NPROW = 12
@ NPROCS = $NPCOL * $NPROW
setenv NPCOL_NPROW "$NPCOL $NPROW";
Verify that the modules are loaded
module list
Currently Loaded Modulefiles:
1) gcc-9.2.0 2) mpi/openmpi-4.1.5
5.7. Submit Job to Slurm Queue to run CMAQ on beeond#
cd /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts
sbatch run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.2x96.ncclassic.csh
5.7.1. Check status of run#
squeue
Output:
[lizadams@ip-0A0A000A scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 hpc CMAQ lizadams CF 0:01 2 beeondtest2-hpc-[1-2]
It takes about 5-8 minutes for the compute nodes to spin up, after the nodes are available, the status will change from CF to R.
5.7.2. Successfully started run#
squeue
Output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
25 hpc CMAQ lizadams R 56:26 2 CycleCloud8-5Beond-hpc-[3-4]
5.7.3. Check that the /mnt/beeond filesystem has been created on the compute nodes#
Login to the compute node by getting the IP address of the compute node. To find this IP address you need to go to the webpage where you configured the Azure CycleCloud Clusters: https://IP-address/home Double click on hpc to show the view details panel. Double click on one of the hpc compute nodes, and a pop-up window will appear, click on connect to obtain the IP address of the compute node.
ssh $USER@IP-address-compute-node
If you are running on 2 compute nodes, then there are two 1.8 TB /nvme drives. Beeond will create a shared 3.5 TB shared drive on /mnt/beeond
df -h
Output:
[lizadams@ip-0A0A000B ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 225G 0 225G 0% /dev
tmpfs 225G 0 225G 0% /dev/shm
tmpfs 225G 18M 225G 1% /run
tmpfs 225G 0 225G 0% /sys/fs/cgroup
/dev/sda2 59G 27G 33G 45% /
/dev/sda1 994M 209M 786M 21% /boot
/dev/sda15 495M 5.9M 489M 2% /boot/efi
/dev/sdb1 472G 216K 448G 1% /mnt
/dev/md10 1.8T 16G 1.8T 1% /mnt/nvme
10.10.0.10:/sched 30G 247M 30G 1% /sched
10.10.0.10:/shared 1000G 95G 906G 10% /shared
tmpfs 45G 0 45G 0% /run/user/20001
beegfs_ondemand 3.5T 31G 3.5T 1% /mnt/beeond
5.7.4. Check on the log file status#
grep -i 'Processing completed.' run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.192.16x12pe.2days.20171222start.2x96.log
Output:
Processing completed... 5.8654 seconds
Processing completed... 5.8665 seconds
Processing completed... 5.8610 seconds
Processing completed... 5.8422 seconds
Processing completed... 5.8657 seconds
Processing completed... 5.8616 seconds
Processing completed... 5.8824 seconds
Processing completed... 5.8581 seconds
Processing completed... 5.8653 seconds
Processing completed... 5.8961 seconds
Processing completed... 7.9473 seconds
Processing completed... 5.4089 seconds
Processing completed... 5.8996 seconds
Processing completed... 5.9659 seconds
Processing completed... 5.9462 seconds
Processing completed... 5.8966 seconds
Processing completed... 5.9326 seconds
Once the job has completed running the two day benchmark check the log file for the timings.
tail -n 30 run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.192.16x12pe.2days.20171222start.2x96.log
Output:
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 192
All times are in seconds.
Num Day Wall Time
01 2017-12-22 1966.4
02 2017-12-23 2170.2
Total Time = 4136.60
Avg. Time = 2068.30
INFO: Using status information file: /tmp/beeond.tmp
INFO: Checking reachability of host 10.10.0.5
INFO: Checking reachability of host 10.10.0.7
INFO: Unmounting file system on host: 10.10.0.5
sudo: do_stoplocal: command not found
INFO: Unmounting file system on host: 10.10.0.7
sudo: do_stoplocal: command not found
INFO: Stopping remaining processes on host: 10.10.0.5
INFO: Stopping remaining processes on host: 10.10.0.7
INFO: Deleting status file on host: 10.10.0.5
INFO: Deleting status file on host: 10.10.0.7
5.8. submit job to run on 1 node x 96 processors#
sbatch run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.1x96.ncclassic.csh
Check result after job has finished
tail -n 30 run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.96.8x12pe.2days.20171222start.1x96.log
Output
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 96
All times are in seconds.
Num Day Wall Time
01 2017-12-22 3278.9
02 2017-12-23 3800.7
Total Time = 7079.60
Avg. Time = 3539.80
5.9. Submit job to run on 3 nodes#
sbatch run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.3x96.ncclassic.csh
Verify the size of the beeond filesystem when using 3 nodes is 5.3 T.
ssh $USER@IP-address-compute-node
df -h
Output:
df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 225G 0 225G 0% /dev
tmpfs 225G 789M 224G 1% /dev/shm
tmpfs 225G 18M 225G 1% /run
tmpfs 225G 0 225G 0% /sys/fs/cgroup
/dev/sda2 59G 27G 33G 45% /
/dev/sda1 994M 209M 786M 21% /boot
/dev/sda15 495M 5.9M 489M 2% /boot/efi
/dev/sdb1 472G 216K 448G 1% /mnt
/dev/md10 1.8T 39G 1.8T 3% /mnt/nvme
10.10.0.10:/sched 30G 247M 30G 1% /sched
10.10.0.10:/shared 1000G 223G 778G 23% /shared
tmpfs 45G 0 45G 0% /run/user/20001
beegfs_ondemand 5.3T 116G 5.2T 3% /mnt/beeond
5.10. Check how quickly the processing is being completed#
grep -i 'Processing completed' run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.288.16x18pe.2days.20171222start.3x96.log
Output
Processing completed... 7.4152 seconds
Processing completed... 5.7284 seconds
Processing completed... 5.6439 seconds
Processing completed... 5.5742 seconds
Processing completed... 5.6011 seconds
Processing completed... 5.5687 seconds
Processing completed... 5.5505 seconds
Processing completed... 5.5686 seconds
Processing completed... 5.5193 seconds
Processing completed... 5.5192 seconds
Processing completed... 5.4985 seconds
Processing completed... 6.7259 seconds
Processing completed... 6.3606 seconds
Processing completed... 5.5312 seconds
5.11. Check results when job has completed successfully#
tail -n 30 run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.288.16x18pe.2days.20171222start.3x96.log
Output
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 288
All times are in seconds.
Num Day Wall Time
01 2017-12-22 1588.4
02 2017-12-23 1721.8
Total Time = 3310.20
Avg. Time = 1655.10
5.12. Check to see if spot VMs are available#
5.13. Unsuccessful slurm status messages#
The NODELIST reason “Nodes required for the job are DOWN…”
Will be generated if a batch is submitted prior to the previous job successfully terminating the nodes
Wait 5 -10 minutes and see if the status changes from PD (pending) to CF (configuring).
squeue
Output
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 hpc CMAQ lizadams PD 0:00 3 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
The NODELIST reason “launch failed requeued held” requires that the job be canceled. Note, if you get this message, it may result in the HPC compute nodes staying up and charging, without running the job, so is important to cancel the job using scancel.
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 hpc CMAQ lizadams PD 0:00 3 (launch failed requeued held)
scancel 3
Confirm that the HPC VMs are deleted by viewing the CycleCloud webpage.
5.14. Change to HB176_v4 compute node#
Terminate the cluster
Edit the cluster configuration
Select HB174_v4 for the HPC compute nodes
Start the cluster
Submit following run script
cd /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts
sbatch run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.1x176.ncclassic.beeond.csh
Login to the compute node to verify beeond was created
df -h
output
df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 378G 0 378G 0% /dev
tmpfs 378G 0 378G 0% /dev/shm
tmpfs 378G 18M 378G 1% /run
tmpfs 378G 0 378G 0% /sys/fs/cgroup
/dev/sda2 59G 27G 33G 45% /
/dev/sda1 994M 209M 786M 21% /boot
/dev/sda15 495M 5.9M 489M 2% /boot/efi
/dev/sdb1 472G 216K 448G 1% /mnt
/dev/md10 3.5T 28G 3.5T 1% /mnt/nvme
10.10.0.10:/sched 30G 247M 30G 1% /sched
10.10.0.10:/shared 1000G 421G 580G 43% /shared
tmpfs 76G 0 76G 0% /run/user/20001
beegfs_ondemand 3.5T 28G 3.5T 1% /mnt/beeond
Note, some of these instructions do not work, as azslurm is not found on the AlmaLinux8 OS. Additional instructions are available here: Azure CycleCloud 8 help for Slurm
5.15. To recover from failure use the terminate cluster option#
If the job does not begin to configure, then you may need to terminate and then restart the cluster.
The terminate option does not delete the software, it only shuts down the scheduler and compute nodes.
The terminate option is equivalent to stopping the cluster. Once it has been stopped, the cluster can be restarted using the Start button.
5.16. If SLURM jobs are in a bad state#
When the job fails the compute nodes are put into an unusable Slurm state. You can try to reset them with scontrol like so:
sudo scontrol update nodename=hpc-[1-2] state=resume
If that doesn’t reset them you can try the CycleCloud command to shutdown the nodes (suspend):
sudo -i azslurm suspend –node-list hpc-[1-2]
Likewise you can start nodes that are in a “bad” Slurm state like this:
sudo -i azslurm resume –node-list hpc-[1-2]
In all the cases above replace hpc-[1-2] with your specific node list.