Out of Memory Error





When the job consumes more memory than requested, Out of memory error occurs. 

  • Slurm checks memory usage on a polling schedule and in this instance, the job was killed due to being out of memory between the polling. This is why the job is showing it ran out of memory, yet the Memory Utilized shows no memory used. 
  • A second footnote would be if the researcher did not specify memory requirements for the job, then the job will receive 2GB / core. (Additional info at https://oneit.uncc.edu/urc/research-clusters/orion-gpu-slurm-user-notes)

Example:

$ seff <jobid>

Job ID: <jobid>

Cluster: starlight

User/Group: <user>/<group>

State: OUT_OF_MEMORY (exit code 0)

Memory Utilized: 0.00 MB (estimated maximum)

Memory Efficiency: 0.00% of 12.00 GB (2.00 GB/core)



Time Limit Error






When the job utilizes more time than the requested time in the submitted job file, time limit error occurs. 

Let’s consider this submit job file, where I have requested one minute of time for the job, but a sleep time of 5minutes after the job ends. 

Submit Job Script

#!/bin/bash

# The job name

#SBATCH --time=1:00

#SBATCH --ntasks=1

#SBATCH --mem=1G

#SBATCH --partition=Orion

sleep 300

Job Output file 

$ more /users/slurm-jobs/examples/date/slurm-1019828.out

slurmstepd: error: *** JOB 1019828 ON str-c139 CANCELLED AT 2021-08-02T09:26:38 DUE TO TIME LIMIT ***


Insufficient Space in Directory



When there is no space in the working directory, slurm jobs might fail. Please ensure the dependent directory holds enough memory and space

If you would like to know how much RAM your compute job is using on the cluster, consider adding the prologue/epilogue into your submit script, which will give you additional job information in your job output log.


Related FAQs

There is no content with the specified labels

Page viewed times