

A list of useful SLURM commands
source link: https://gist.github.com/TysonRayJones/34ebca7056cadc60c32dd3d138388a14
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In addition to Harvard's fantastic list, we list some other convenient SLURM commands.
Delay an enqueued job from running
Useful for letting other enqueued jobs run without having to kill/re-run already running jobs. To delay for 7 days:
scontrol update JobID=<JOB ID> StartTime=now+7days
The NODELIST(REASON)
field reported by squeue
for the delayed jobs will become (BeginTime)
.
Requeue and immediately delay running jobs
when
suspend
andhold
don't seem to do anything!
You may want to stop running jobs and requeue them further down the queue (i.e. avoid immediately re-runing them). This is useful for freeing up nodes to let other jobs run without having to resubmit your running jobs.
To requeue a job and delay it for one day:
(export jobid=<JOB ID>; scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day)
If you have many jobs (with unique job-ids), you'll want to type out a list of jobs to requeue and delay using a for
loop:
for jobid in <SPACE SEPARATED LIST OF JOB IDS>; do scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day; done
If many of your jobs share a common prefix which you don't want to retype; export it!
(export prefix=<COMMON JOB ID PREFIX>; for suffix in <SPACE SEPARATED LIST OF JOB ID SUFFIXES>; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+1day; done)
For example, of the following job id list...
1234567_10
1234567_11
1234567_12
1234567_13
1234567_14
if you want to requeue + delay jobs 1234567_11
and 1234567_12
for 2 days, you'd call
(export prefix=1234567_1; for suffix in 1 2; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+2days; done)
Note that SLURM will often not list the re-queued jobs in
squeue
, but rest assured, they're still enqueued!
Take care to ensure your jobs have everything they need (e.g. files) when they're eventually re-run.
Keep in mind re-queued jobs may behave differently when re-run. Think carefully e.g. about your random seeding!
Recommend
-
56
除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。
-
26
Often times, there is a need to check the size of the sub-directories or files in a project, on a live or development server, or just at our local machine. Here is a list of useful commands that you can leverage...
-
63
HPC made easy: Announcing new features for Slurm on GCP 2019-03...
-
9
Useful FFmpeg commands for video editing Sun 16 August 2020 By Andre...
-
5
Git useful commands, code management Reading Time: 6 minutes Hi all. In this blog, we will continue to explore some of the useful git commands. These commands would help when we are working on a remote shared re...
-
51
Multi-node-training on slurm with PyTorch What's this? A simple note for how to start multi-node-training on slurm scheduler with PyTorch. Useful especially when scheduler is too busy that you cannot get multi...
-
3
Slurm HPC Job Scheduler Applies For Work In AI And Hybrid Cloud The Slurm Workload Manager that has its origins at Lawrence Livermore National Laboratory as the Simple Linux Utility for Resource Management – and which is us...
-
6
Some useful MySQL/MariaDB commands to create and provide access to a database. ...
-
5
Introduction Local HPC clusters continue to play a vital role in scientific research. Many universities, research institutions, and companies continue to maintain on-site clusters despite the efforts of the big cloud providers to exp...
-
6
set up Slurm Accounting feature (sacct) with slurmdbd/MySQL on AWS ParallelCluster · GitHub Instantly share code, notes, and snippets. ...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK