Submitting Jobs to Lonestar6
Overview
The main point of using lonestar6 is that it is a massive computer cluster. If we run a command when logged into lonestar6, we are running it on one of the three low memory, low power "head" or "login" nodes on TACC. When we do serious computations that are going to take more than a few minutes or use a lot of RAM, we need to submit them to one of the other 560 computer nodes and 71, 680 processor cores on lonestar6.
In this section we are going to learn how to submit a job to the lonestar6 cluster.
Diagram of how a job gets run on lonestar6
Explanation
Start at the bottom - that's what you want: one of lonestar6 560 compute nodes running your specific program (bowtie mapping in this case).
To get there, you must go through a "Queue Manager" program running on a different computer - the login (or "head") node. This program keeps track of what's running on those 560 nodes and what's in line to run next. It's very good at doing this.
You tell the Queue Manager what you want done via "bowtie.slurm" - your job submission script. That specifies how many nodes you need, what allocation to use, the maximum run time of the job, etc. The Queue Manager doesn't really care what you're running, just how you want it run. It needs to pass info on what you're running off to the compute node - you do that with the line setenv CONTROL_FILE commands
.
The Queue Manager sends off the commands in the file commands
to the compute nodes; so commands
is really the first thing to start with.
The launcher_creator.py
script just helps you by creating bowtie2.slurm
easily - saves you some time editing a file (and potentially messing it up).
Launcher
In the examples we tend to say that a job can be "interactive" or should be "submitted to the TACC queue". The first means that you can type it and run it directly. It should be short enough that it does not tie up the TACC head node. The second means that you should go through the launcher submission process described here.
If you do try to run a long job in interactive mode. It will be killed after 10-15 minutes and you may see a message like this:
|
Commands file
A launcher file tells lonestar6 which executables to run with your desired options and for how long. It requests a certain amount of resources (cores and time) so that lonestar6's scheduling program figure out where to fit your job in.
All we need to do is create a text file. Each line in this text file, which we will call simply commands
, is a command exactly as you would type it into the terminal yourself to have it run.
commands file
|
- The minimum number of processors that you can request on lonestar6 is 24, so you might as well add up to 22 more lines to this file that are different shell commands that will give some sort of output. Each will be run on a different core in parallel.
Launcher script
Two ways to skin this cat
1.We have supplied a sample launcher script which you can modify to queue and execute your job. Here's how:
cp /work2/projects/BioITeam/projects/courses/rnaseq_course/day_1_partA/launcher.slurm . |
Typically, we would change:
The -N line specifies the name of the job.
The -o line specifies the names of the output files that lonestar6 makes.We would change that to the name of this job.
The -t line specifies the length of time given to the job. The more time we give our job, the longer in the queue our job will wait to be run. When the time is up, lonestar6 will terminate our job whether or not it's finished. So it's best to give our job slightly more time than it'll take.
We can also, optionally, add a few lines to have lonestar6 send an email to your email address when the job starts and finishes.
To do that, under -V, we would add 2 new lines like so:
|
Also, if we are part of multiple allocations, we'll need to specify which allocation to use (Case sensitive).
|
Lastly, we need to specify the job file.
|
2. We can use launcher_creator.py to edit the launcher file without even opening it.
We have created a Python script called launcher_creator.py
that makes creating a launcher.sge
script a breeze. You will probably want to use this for the rest of the course.
Now run the script with the -h
option to show the help message:
|
-n | name | The name of the job. |
-a | allocation | The allocation you want to charge the run to. |
-q | queue* | The queue to submit to, like 'normal' or 'development', etc. |
-w | wayness** | Optional The number of jobs in a job list you want to give to each node. (Default is 24 for Lonestar6, 48 for stampede2.) |
-N | number of nodes | Optional Specifies a certain number of nodes to use. You probably don't need this option, as the launcher calculates how many nodes you need based on the job list (or Bash command string) you submit. It sometimes comes in handy when writing pipelines. |
-t | time | Time allotment for job, format must be hh:mm:ss. |
-e | Optional Your email address if you want to receive an email from lonestar6 when your job starts and ends. | |
-l | launcher | Optional Filename of the launcher. (Default is |
-m | modules | Optional String of module management commands. |
-b | Bash commands | Optional String of Bash commands to execute. |
-j | Command list | Optional Filename of list of commands to be distributed to nodes. |
-s | stdout | Optional Setting this flag outputs the name of the launcher to stdout. |
We should mention that launcher_creator.py
does some under-the-hood magic for you and automatically calculates how many cores to request on lonestar6, assuming you want one core per process. You don't know it, but you should be grateful that this saves you from ever having to think about a confusing calculation.
*queue: There are several queue options: normal allows jobs to run for 48 hours, slowest queue, but most often used. Development allows jobs to run for 2 hours only- use this for testing/development/learning purposes. Largemem512GB and largemem1TB queue to use one of those large memory nodes for jobs that need lots of memory (like assembly).
**wayness: If you want to allocate more memory per job, use wayness to reduce the number of jobs that get assigned to each node.
Example: If you have two jobs in your commands file and you do not use wayness, by default, both jobs will be run on the same node and will be allocated 48gb/24 (~2gb) of memory per job (on lonestar6). If you use wayness of 1, each job will be run on a different node and each job will be allocated 24gb of memory.
**Wayness (commands/tasks per node)
Wayness sets how many commands/tasks are run on each compute node. By default, wayness will be 64 (equal to the number of physical cores per node on lonestar6). Each task will then get 1/64th of the memory available for one node (256 GB of memory) = 4 GB per task. Often, that is not enough memory per task or you may not even have 64 tasks in your commands file. Setting wayness to a number smaller than the default will allocate more memory per task.
tasks per node (wayness) | cores available to each task | memory available to each task |
---|---|---|
1 | 64 | 256 GB |
2 | 32 | 128 GB |
4 | 16 | 64 GB |
8 | 8 | 32 GB |
16 | 4 | 16 GB |
64 | 1 | 4 GB |
launcher_creator.py -n test -t 01:00:00 -j commands -q development -a OTH21164
lonestar6 Queue
Next step would be to submit the job to the queue by using the launcher file.
|
lonestar6 will make sure that everything specified in the launcher file is correct and if it is, the job will be queued.
To check the status of the job, the command is:
|
This will tell you its job priority and what state it is in.
If the job is in the list of "waiting jobs", this means the job has been queued and is waiting to start.
If the job is in the list of "active jobs", this means the job is running.
Field | Description |
---|---|
JOBID | job id assigned to the job |
USER | user that owns the job |
STATE | current job status, including, but not limited to: |
CD | (completed) |
CF | (cancelled) |
F | (failed) |
PD | (pending) |
R | (running) |
In case we notice something wrong with the job, we can delete it like so:
|
To obtain the job-ID, look at the "showq" output.
You can create a job that is dependent on another job finishing only start after the first job has completed using this command:
|
If you are part of a reservation (like for this class), your jobs will have higher priority in the queue. You will need to specify the reservation in this manner:
|
TACC Output Files
While your job is running, TACC creates 3 different files with names based on the -o field in the launcher. These files are named like so:
|
These files have the output of your job that would have been sent to standard output or standard error and messages from TACC about your job. These files will be useful if your job fails.
An exception to submitting jobs: IDEV
Idev sessions are interactive sessions that you can start from a login node. In this case, you are requesting the queue to provide you with a certain amount of time on a compute node. Once the request goes through, you will be logged into a compute node on which you can directly run commands (like bwa,bowtie etc). You will not need to submit a job since you are already logged into a compute node.
Idev sessions are useful for short, quick analyses that you'd like to run or some development/testing that you'd like to do. Such processes could benefit from an interactive nature rather than from packaging into a big job. Idev sessions can be requested using the following command:
|
Now let's go try all new skills out with a simple exercise...
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.