Setting up a single server SLURM cluster
Shared computation resources can easily get crowded as everyone log on and start their jobs. Running above full load will cause excess context switching and in worst case swapping, resulting in sub-optimal performance. We would then benefit from running the tasks ordered in a more optimal manner. Coordinating this is a job which could be left to the machine itself.
In this post, I’ll describe how to setup a single-node SLURM mini-cluster to implement such a queue system on a computation server. I’ll assume that there is only one node, albeit with several processors. The computation server we use currently is a 4-way octocore E5-4627v2 3.3 GHz Dell PowerEdge M820 with 512 GiB RAM.
The setup will be described in the usual “read-along” style, with commands to follow. And yes, in case you wondered, you’ll need sudo access to the machine!
Setting up Control Groups
We don’t have a separate login node as is usual on supercomputers, so the entire machine is still usable directly from an SSH command-line. To discourage this, we need to take away the ability to get unfettered access to the machine.
Cgroups is a neat Linux feature to control process resource use, but it is not installed by default on a Ubuntu Trusty system, so we start out by installing the package which contains this feature. We add the ability to restrict memory usage too, which require kernel command-line option on Debian derivatives.
A quick starter for those who haven’t used Cgroups before: A controller is a resource that is restricted, such as CPU or memory, whereas a cgroup is a policy for some set of processes on how to use these resources. The controllers are built-in and the cgroups are setup by administrators.
We’ll set up two cgroups, a root cgroup which have unrestricted access to the machine as before, and an “interactive” cgroup which we’ll limit to only half of a single CPU and 2 GiB of RAM, more or less equivalent to an old workstation. (Note that comments in this config file must start at the very beginning of the line).
Now, Ubuntu 12.04 has a cgconfig
daemon that take care of setting up cgroups, and Ubuntu 16.04 will use systemd to do the same, but 14.04 falls in between and has no default way of setting up custom cgroups (!). We’ll solve this by adding an Upstart script which does this based on the configuration file:
Now we have our cgroups, next task is to setup how processes are assigned to them based on user groups. Here we let the root user (which starts all the daemons) and the slurm user, which will be setup later and which starts jobs from the queue, get the root cgroup. Everyone else (which has logged on with SSH) is supposed to be put in the “interactive” group.
But still there is nothing which actually performs these assignments. We solve this by installing a PAM plugin, which will move the login process of people SSHing into the system, into the right cgroup:
Picking Passwords
SLURM will store the accounting information in a database. Unfortunately, Postgres-based accounting is not mature yet, SQLite-based accounting is non-existing and file-based accounting does not work properly and is deprecated. Thus, we are stuck with using MySQL, which I have to admin is not my favorite to have running on the server.
Anyway, we’ll try to lock it down as good we can, and in that process we’ll need two passwords: One to access the meta-database in MySQL itself, and one to access the accounting database. These passwords will be written down in (locked-down) configuration files (!), so we can generate more secure passwords since we don’t have to memorize them:
Since we need these passwords two places: once in MySQL itself and once in the configuration file, I’ll put it in environment variables.
Setting up MySQL
First we setup the database daemon itself and make sure that it only listen on local connections:
Next, we’ll enable the root user to access the database password-less (if you have access to the root user’s account or files you have bigger problems than this anyway):
Set the root user password in MySQL, allow only local connections and remove the anonymous guest user and the test database:
Now we have a satisfactory MySQL instance, let’s create the database for SLURM accounting:
Setting up the SLURM daemons
SLURM consists of four daemons: “munge”, which will authenticate users to the cluster, “slurmdbd” which will do the authorization, i.e. checking which access the user has to the cluster, “slurmctld” which will accept requests to add things to the queue, and “slurmd” which actually launches the tasks on each computation node.
Let us start with the authentication daemon. Notice that munge does not think the default option for logging is secure (!), so we must change this to using syslog before it will start.
A problem with Ubuntu 14.04 is that daemons started by SysV-init files will get the cgroup of the console that started them, so we cannot boot the daemons ourselves but must let Upstart do so. By installing first and then writing the config file, we stall the startup of the daemon until we have everything ready.
Configuration
Basically, the configuration file for the authorization daemon just specifies how to connect to the database (we’ll add entries later):
The main configuration file is where we setup how the cluster should behave. Notable options here are proctrack/cgroup
to have SLURM use a cgroup to put the jobs in, and sched/backfill
which indicates that the system should try to run several jobs to keep full utilization.
Queues in SLURM are called “partitions”. A set of nodes can be shared between partitions, or they can belong exclusively to only one. Each partition have a priority in case a node is covered by more than one.
We’ll set up three partitions, which all contains the one node that is the computation server.
Limitations to the cluster in SLURM is set up using “associations”, which are tuples of: (cluster, account, user, partition) with some properties attached. When we add an association, we say that the user can submit a job to this queue (partition), charging the CPU time to that project (account). Use sacctmgr show assoc format=account,user,partition
to see the current list.
Unfortunately, associations are hierarchial in the order listed above, meaning that we cannot grant access to a partition for an entire account, but must do it per-user. From version 15.08 and out we should be able to use an orthogonal “quality of service” property to specify the priorities, but in the current version 2.6, it cannot preempt jobs into standby, only cancellation. Thus, for now, we set PreemptType to partition_prio which means that the priority of the queue is used.
To have the cluster only accept submissions to queues that have been explicitly allowed, we set AccountingStorageEnforce=limits
.
Now that we’re done with the configuration and we don’t need them anymore, we’ll clean the passwords from the environment.
All the configuration is now done, and we are ready to start the services. On Ubuntu Trusty there is a problem that services that are not started by Upstart will get the same cgroup as the login session (which is the restrained “interactive” session in our case). Anyway, since we have made changes to the kernel configuration, this is a good time to restart the machine.
If you do minor changes to the configuration file, you can use the scontrol reconfig
command to have the daemon reread slurm.conf.
Accounting
Now that the programs have been installed (and should be running), we’ll add some accounting features. We only have one cluster to manage, namely our own server:
We’ll then add “accounts”; accounts in SLURM parlance is really grouping of users into research groups and/or projects. Here is an example on how to add two such groups:
Finally, we need some users that can fill the accounts. The names of the users should be the system name that you logon with.
Notice that we first create the user and give default account and partition. The next association is set up by “creating” the user once more, but with a different account and/or partition.
Testing the installation
Use these commands to check that the cluster is live and ready to accept commands:
To test the installation, I’ll download a benchmark test, modify it slightly to run a little longer, and then submit it to a queue. Using htop
, you can verify that it now runs with full CPU utilization.
We’ll submit a background workload to the queue with least priority:
“Task” is the term for a process, and “cpu” is used for a core. An MPI process will typically have several tasks with one cpu per task, and an OpenMP process will have one task but use several cpus per task.
We then submit a smaller job to a queue with higher priority, here shown using the PBS/TORQUE compatibility layer for illustration. Notice that the TORQUE wrapper does not set the PBS_NUM_PPN environment variable, so you will have to set the number of cores as a constant in the script.