Running a job on Lonestar

Job execution on Lonestar

Explanation

Start at the bottom, with your objective: One of Lonestar's 1,888 compute nodes running your specific program (bowtie mapping in this case).

To get there, you must go through a "Queue Manager" program running on a different computer - the login (or "head") node. This program keeps track of what's running on those 1,888 nodes and what's in line to run next.

You tell the Queue Manager what you want done via a file named something like "job.sge" - your job submission script. It specifies how many nodes you need, what allocation to use, the maximum run time of the job, etc. The Queue Manager doesn't really care what you're running, just how you want it run. It passes info on what you're running off to the compute node once it's established the rules for how you need to run it.

Batch computing

The main point of using Lonestar is that it is a massive computer cluster. We have, up to this point, been running all of our commands in "interactive mode", where we type a command and then sit around and wait it to complete. We can only really do one command at a time this way. Furthermore, we've been using "head" or "login" node on TACC when we do this. When we do serious computations that are going to take more than a few minutes or use a lot of RAM, we need to submit them to one of the other 1,888 computer nodes (containing 22,656 CPUs) on Lonestar.

In this section we are going to learn how to submit a job to the Lonestar cluster.

In the examples we tend to say that a job can be "interactive" or should be "submitted to the TACC queue". The first means that you can type it and run it directly. It should be short enough that it does not burden the TACC head node. The second means that you should go through the submission process described for batch jobs.

If you do try to run a long job in interactive mode, it will be killed after 10-15 minutes and you may see a message like this:

Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ...
Please do not run scripts or programs that require more than a few minutes of
CPU time on the login nodes.  Your current running process below has been
killed and must be submitted to the queues, for usage policy see
http://www.tacc.utexas.edu/user-services/usage-policies/
If you have any questions regarding this, please submit a consulting ticket.

A simple job script

A job script tells Lonestar which executables to run with your desired options and for how long. It requests a certain amount of resources (CPUs and time) so that the scheduling program can figure out where to fit your job in.

Start by editing a text file 'job.sge'

 Hint
nano job.sge

Now, add some control lines to the top of the job.sge file
(Feel free to copy and paste from here where appropriate)

Line

Function

Optional

#!/bin/bash

Tells the computer which shell to use

No

#$ -V

Tells the job to inherit its 'environment' from your current shell session

No

#$ -cwd

Tells the compute node to run the job out of the directory you submit from

Yes

#$ -q development

Specifies which queue to submit to

No

#$ -pe 1way 12

Specifies how many CPUs to use

No

#$ -N firstJob

The name of your job

Yes

#$ -A 20121008-NGS-ACES

The name of the account to charge this job to

No

#$ -l h_rt=00:05:00

Tells the scheduler how long you estimate the job will take*

No

Options can be passed to qsub on the command line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used qsub commands in a script file that will be reused several times rather than retyping the qsub commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script. See the Lonestar User Guide for more details.

* The -l line specifies the length of time given to the job. The more time we give our job, the longer in the queue our job will wait to be run. When the time is up, Lonestar will terminate our job whether or not it's finished. So, it's best to give our job slightly more time than you believe it will take.

Add some the actual command(s) to 'job.sge'
date > date.out
ls > ls.out
 The final product

Your file should very closely resemble the following:

job.sge
#!/bin/bash
#$ -V 
#$ -cwd
#$ -q development
#$ -pe 1way 12
#$ -N firstJob
#$ -A 20121008-NGS-ACES
#$ -l h_rt=00:05:00

date > date.out
ls > ls.out

Questions

These are all answerable by consulting the Lonestar User Guide.
Question 1: How long have we estimated it will take this job to run?

 Answer

#$ -l h_rt=00:05:00 is in hh:mm::ss format, which means we have reqeusted 5 minutes of run time

Question 2: How could we add email notification to this job?

 Answer
# Specify a valid email address
-M vaughn@tacc.utexas.edu
# Indicate which events result in an email (begin,end,abort in this case)
-m bea

Question 3: What is the maximum execution time for the development queue? For the normal queue?

 Answer

Queue

Max Time

Max Processors

development

1h

264

normal

24 h

4104

serial

12 h

12

largemem

24 h

48

gpu

24 h

48

vis

24 h

48

Submitting to the queue

Now that we have our job file ready, submit it to the queue:

  qsub job.sge

Lonestar will make sure that everything you've specified is correct and if it is, your job will be queued.

You can check the status of your job like so:

  qstat

This will tell you its job priority and what state it is in.

  • A state of "qw" means "queued"
  • A state of "r" means "running"

If you happen to notice that your job will run incorrectly, you can delete your job like so:

  qdel job-ID

(You can obtain the job-ID by typing "qstat" - it's the first column in the result)

If you are nosy and want to see all of the jobs queued and running on Lonestar, then use this command:

  showq

You can also see just your jobs in this format:

  showq -u

Output Files

In addition to any files that result from your commands, while your job is running, the queue manager creates 4 additional files in your working directory. These files are named:

 (job_name).e(job-ID)
 (job_name).o(job-ID)
 (job_name).pe(job-ID)
 (job_name).po(job-ID)

They containt the output of your job that would have been sent to standard output (STDOUT) or standard error (STDERR) and messages from the scheduler and execution host about your job. These files will be useful if your job fails.