Running a job on Lonestar
Job execution on Lonestar
Explanation
Start at the bottom, with your objective: One of Lonestar's 1,888 compute nodes running your specific program (bowtie mapping in this case).
To get there, you must go through a "Queue Manager" program running on a different computer - the login (or "head") node. This program keeps track of what's running on those 1,888 nodes and what's in line to run next.
You tell the Queue Manager what you want done via a file named something like "job.sge" - your job submission script. It specifies how many nodes you need, what allocation to use, the maximum run time of the job, etc. The Queue Manager doesn't really care what you're running, just how you want it run. It passes info on what you're running off to the compute node once it's established the rules for how you need to run it.
Batch computing
The main point of using Lonestar is that it is a massive computer cluster. We have, up to this point, been running all of our commands in "interactive mode", where we type a command and then sit around and wait it to complete. We can only really do one command at a time this way. Furthermore, we've been using "head" or "login" node on TACC when we do this. When we do serious computations that are going to take more than a few minutes or use a lot of RAM, we need to submit them to one of the other 1,888 computer nodes (containing 22,656 CPUs) on Lonestar.
In this section we are going to learn how to submit a job to the Lonestar cluster.
In the examples we tend to say that a job can be "interactive" or should be "submitted to the TACC queue". The first means that you can type it and run it directly. It should be short enough that it does not burden the TACC head node. The second means that you should go through the submission process described for batch jobs.
If you do try to run a long job in interactive mode, it will be killed after 10-15 minutes and you may see a message like this:
Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ... Please do not run scripts or programs that require more than a few minutes of CPU time on the login nodes. Your current running process below has been killed and must be submitted to the queues, for usage policy see http://www.tacc.utexas.edu/user-services/usage-policies/ If you have any questions regarding this, please submit a consulting ticket.
A simple job script
A job script tells Lonestar which executables to run with your desired options and for how long. It requests a certain amount of resources (CPUs and time) so that the scheduling program can figure out where to fit your job in.
Start by editing a text file 'job.sge'
Now, add some control lines to the top of the job.sge file
(Feel free to copy and paste from here where appropriate)
Line |
Function |
Optional |
---|---|---|
#!/bin/bash |
Tells the computer which shell to use |
No |
#$ -V |
Tells the job to inherit its 'environment' from your current shell session |
No |
#$ -cwd |
Tells the compute node to run the job out of the directory you submit from |
Yes |
#$ -q development |
Specifies which queue to submit to |
No |
#$ -pe 1way 12 |
Specifies how many CPUs to use |
No |
#$ -N firstJob |
The name of your job |
Yes |
#$ -A 20121008-NGS-ACES |
The name of the account to charge this job to |
No |
#$ -l h_rt=00:05:00 |
Tells the scheduler how long you estimate the job will take* |
No |
Options can be passed to qsub on the command line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used qsub commands in a script file that will be reused several times rather than retyping the qsub commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script. See the Lonestar User Guide for more details.
* The -l line specifies the length of time given to the job. The more time we give our job, the longer in the queue our job will wait to be run. When the time is up, Lonestar will terminate our job whether or not it's finished. So, it's best to give our job slightly more time than you believe it will take.
date > date.out ls > ls.out
Questions
These are all answerable by consulting the Lonestar User Guide.
Question 1: How long have we estimated it will take this job to run?
Question 2: How could we add email notification to this job?
Question 3: What is the maximum execution time for the development queue? For the normal queue?
Submitting to the queue
Now that we have our job file ready, submit it to the queue:
qsub job.sge
Lonestar will make sure that everything you've specified is correct and if it is, your job will be queued.
You can check the status of your job like so:
qstat
This will tell you its job priority and what state it is in.
- A state of "qw" means "queued"
- A state of "r" means "running"
If you happen to notice that your job will run incorrectly, you can delete your job like so:
qdel job-ID
(You can obtain the job-ID by typing "qstat" - it's the first column in the result)
If you are nosy and want to see all of the jobs queued and running on Lonestar, then use this command:
showq
You can also see just your jobs in this format:
showq -u
Output Files
In addition to any files that result from your commands, while your job is running, the queue manager creates 4 additional files in your working directory. These files are named:
(job_name).e(job-ID) (job_name).o(job-ID) (job_name).pe(job-ID) (job_name).po(job-ID)
They containt the output of your job that would have been sent to standard output (STDOUT) or standard error (STDERR) and messages from the scheduler and execution host about your job. These files will be useful if your job fails.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.