18.1.5. Copy the run scripts from the CycleCloud repo to run on HBv120#
Note, the run scripts are tailored to the Compute Node. This assumes the cluster was built with HC44rs compute nodes.
Change directories to where the run scripts are available from the git repo.
cd /shared/data/2018_12US1/CMAQ_v54+_cb6r5_scripts
Copy the run scripts to the run directory
cp *.csh /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts/`
18.1.6. Edit the run script to run on 192 pes#
cd /shared/build/openmpi_gcc/CMAQ_v54/CCTM/scripts/
sbatch run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.2x96.ncclassic.csh
Note, it will take about 3-5 minutes for the compute notes to start up This is reflected in the Status (ST) of PD (pending), with the NODELIST reason being that it is configuring the partitions for the cluster
18.1.7. Check the status in the queue#
squeue
output:
[lizadams@CMAQSlurmHC44rsAlmaLinux-scheduler scripts]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 hpc CMAQ lizadams CF 0:02 2 CycleCloud8-5-hpc-[1-2]
After 5 minutes the status will change once the compute nodes have been created and the job is running
squeue
output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6 hpc CMAQ lizadams R 0:37 2 CycleCloud8-5-hpc-[1-2]
The 192 pe job should take 85 minutes to run (42 minutes per day)
Note, if the job does not get scheduled, examine the slurm logs
sudo vi /var/log/slurmctld/slurmctld.log
sudo vi //var/log/slurmctld/resume.log
18.1.8. check the timings while the job is still running using the following command#
cd /shared/data/output/output_v54+_cb6r5_ae7_aq_WR413_MYR_gcc_2018_12US1_2x96_classic_shared
grep 'Processing completed' CTM_LOG_001*
output:
Processing completed... 27.3354 seconds
Processing completed... 5.3785 seconds
Processing completed... 5.4735 seconds
Processing completed... 5.4295 seconds
Processing completed... 5.4600 seconds
Processing completed... 5.5127 seconds
Processing completed... 5.4453 seconds
Processing completed... 5.4866 seconds
Processing completed... 5.4551 seconds
Processing completed... 5.4535 seconds
Processing completed... 5.4729 seconds
Processing completed... 7.7710 seconds
18.1.9. When the job has completed, use tail to view the timing from the log file.#
tail -n 30 /shared/build/openmpi_gcc/CMAQ_v54+_classic/CCTM/scripts/run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.192.16x12pe.2day.20171222start.2x96.shared.log
output:
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 192
All times are in seconds.
Num Day Wall Time
01 2017-12-22 2549.7
02 2017-12-23 2752.4
Total Time = 5302.10
Avg. Time = 2651.05
18.1.10. Check whether the scheduler thinks there are cpus or vcpus#
sinfo -lN
output:
Wed Jan 10 19:24:35 2024
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
cyclecloudlizadams-hpc-pg0-1 1 hpc* allocated 96 96:1:1 443596 0 1 cloud none
cyclecloudlizadams-hpc-pg0-2 1 hpc* idle~ 96 96:1:1 443596 0 1 cloud none
cyclecloudlizadams-hpc-pg0-3 1 hpc* idle~ 96 96:1:1 443596 0 1 cloud none
cyclecloudlizadams-hpc-pg0-4 1 hpc* idle~ 96 96:1:1 443596 0 1 cloud none
cyclecloudlizadams-hpc-pg1-1 1 hpc* idle~ 96 96:1:1 443596 0 1 cloud none
cyclecloudlizadams-htc-1 1 htc idle~ 2 2:1:1 3072 0 1 cloud none
cyclecloudlizadams-htc-2 1 htc idle~ 2 2:1:1 3072 0 1 cloud none
cyclecloudlizadams-htc-3 1 htc idle~ 2 2:1:1 3072 0 1 cloud none
cyclecloudlizadams-htc-4 1 htc idle~ 2 2:1:1 3072 0 1 cloud none
cyclecloudlizadams-htc-5 1 htc idle~ 2 2:1:1 3072 0 1 cloud none
18.1.11. Edit the run script to run on 96 pes#
sbatch run_cctm_2018_12US1_v54_cb6r5_ae6.20171222.1x96.ncclassic.retest.shared.csh
18.1.12. Check the timing after run completed#
tail -n 30 run_cctm5.4+_Bench_2018_12US1_cb6r5_ae6_20200131_MYR.96.8x12pe.2day.20171222start.1x96.shared.log
Output
==================================
***** CMAQ TIMING REPORT *****
==================================
Start Day: 2017-12-22
End Day: 2017-12-23
Number of Simulation Days: 2
Domain Name: 12US1
Number of Grid Cells: 4803435 (ROW x COL x LAY)
Number of Layers: 35
Number of Processes: 96
All times are in seconds.
Num Day Wall Time
01 2017-12-22 3744.5
02 2017-12-23 4184.8
Total Time = 7929.30
Avg. Time = 3964.65