Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleCreate batch submission script for simple commands
launcher_creator.py -j simple.cmds -n simple -t 00:01:00 -a OTH21164 -q normaldevelopment

You should see output something like the following, and you should see a simple.slurm batch submission file in the current directory.

Code Block
Project simple.
Using job file simple.cmds.
Using normal queue.
For 00:01:00 time.
Using OTH21164 allocation.
Not sending start/stop email.
Launcher successfully created. Type "sbatch simple.slurm" to queue your job.

Submit your batch job (with or without the --reservation), then check the batch queue to see the job's status.

Code Block
languagebash
titleSubmit simple job to batch queue
sbatch --reservation CoreNGS-Tue simple.slurm  # or:
sbatch simple.slurm
showq -u

# Output looks something like this:
-------------------------------------------------------------
          Welcome to the Lonestar6 Supercomputer
-------------------------------------------------------------
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Checking available allocation (OTH21164)...OK
Submitted batch job 232542

...

This technique is sometimes called filename globbing, and the pattern a glob. Don't ask me why – it's a Unix thing. Globbing – translating a glob pattern into a list of files – is one of the handy thing the bash shell does for you. (Read more about Pathname wildcards)

Exercise: How would you list all files starting with "simple"?

Expand
titleAnswer
ls simple*

...

Every command and Unix program has three "built-in" streams: standard input, standard output and standard error, each with a name, a number, and a redirection syntax.


Normally echo writes its string to standard output. If you invoke echo in an interactive shell like Terminal, standard output is displayed to the Terminal window.All outputs generated by tasks in your batch job are directed to , but it could encounter an error and write an error message to standard error. We want both standard output and standard error for each task stored in a log file named for the command number.

So in the above example the first '>' says to redirect the standard output of the echo command to the cmd3.log file. The '2>&1' part says to redirect standard error to the same place. Technically, it says to redirect standard error (built-in Linux stream 2) to the same place as standard output (built-in Linux stream 1); and since standard output is going to cmd3.log, any standard error will go there also. (Read more about Standard streams and redirection)

When the TACC batch system runs a job, all outputs generated by tasks in the batch job are directed to one output and error file per job. Here they have names like simple.e924965 and simple.o924965; . simple.o924965 contains all standard output and simple.o924965 contains all standard error generated by your tasks that was not redirected elsewhere, as well as information relating to running your job and its tasks. For large jobs with complex tasks, it is not easy to troubleshoot execution problems using these files.

So a best practice is to separate the outputs of all our tasks into individual log files, one per task, as we do here. Why is this important? Suppose we run a job with 100 commands, each one a whole pipeline (alignment, for example). 88 finish fine but 12 do not. Just try figuring out which ones had the errors, and where the errors occurred, if all the standard output is in one intermingled file and all standard error in the other intermingled file!So in the above example the first '>' says to redirect the standard output of the echo command to the cmd3.log file. The '2>&1' part says to redirect standard error to the same place. Technically, it says to redirect standard error (built-in Linux stream 2) to the same place as standard output (built-in Linux stream 1); and since standard output is going to cmd3.log, any standard error will go there also. (Read more about Standard I/O streams.) standard error in the other intermingled file!

Job parameters

Now that we've executed a really simple job, let's take a look at some important job submission parameters. These correspond to arguments to the launcher_creator.py script.

A bit of background. Historically, TACC was set up to cater to researchers writing their own C or Fortran codes highly optimized to exploit parallelism (the HPC crowd). Much of TACC's documentation is aimed at this audience, which makes it difficult to pick out the important parts for us.

...

The launcher module knows how to interpret various job parameters in the <job_name>.slurm batch SLURM submission script and use them to create your job and assign its tasks to compute nodes. Our launcher_creator.py program is a simple Python script that lets you specify job parameters and writes out a valid <job_name>.slurm submission script.

...

Code Block
titlelauncher_creator.py usage
usage: launcher_creator.py [-h] -n NAME -t TIME_REQUEST [-j JOB_FILE]
                           [-b SHELL_COMMANDS] [-B SHELL_COMMANDS_FILE]
                           [-q QUEUE] [-a [ALLOCATION]] [-m MODULES]
                           [-M MODULES_FILE] [-w WAYNESS] [-N NUM_NODES]
                           [-e [EMAIL]] [-l LAUNCHER] [-s]

Create launchers for TACC clusters. Report problems to rt-
other@ccbb.utexas.edu

optional arguments:
  -h, --help            show this help message and exit

Required:
  -n NAME, --name NAME  The name of your job.
  -t TIME_REQUEST, --time TIME_REQUEST
                        The time you want to give to your job. Format:
                        hh:mm:ss

Commands:
  You must use at least one of these options to submit your commands for   TACC.

  -j JOB_FILE, --jobs JOB_FILE
                        The name of the job file containing your commands.
  -b SHELL_COMMANDS, --bash SHELL_COMMANDS
                        A string of shell (Bash, zsh, etc) commands that are
                        executed before any parametric jobs are launched.
  -B SHELL_COMMANDS_FILE, --bash_file SHELL_COMMANDS_FILE
                        A file containing shell (Bash, zsh, etc) commands that
                        are executed before any parametric jobs are launched.

Optional:
  -q QUEUE, --queue QUEUE
                        The TACC allocation for job submission.
                        Default="development"
  -a [ALLOCATION], -A [ALLOCATION], --allocation [ALLOCATION]
                        The TACC allocation for job submission. You can set a
                        default ALLOCATION environment variable.
  -m MODULES, --modules MODULES
                        A list of module commands. The "launcher" module is
                        always automatically included. Example: -m "module
                        swap intel gcc; module load bedtools"
  -M MODULES_FILE, --modules_file MODULES_FILE
                        A file containing module commands.
  -w WAYNESS, --wayness WAYNESS
                        Wayness: the number of commands you want to give each
                        node. The default is the number of cores per node.
  -N NUM_NODES, --num_nodes NUM_NODES
                        Number of nodes to request. You probably don't need
                        this option. Use wayness instead. You ONLY need it if
                        you want to run a job list that isn't defined at the
                        time you submit the launcher.
  -e [EMAIL], --email [EMAIL]
                        Your email address if you want to receive an email
                        from Lonestar when your job starts and ends. Without
                        an argument, it will use a default EMAIL_ADDRESS
                        environment variable.
  -l LAUNCHER, --launcher_name LAUNCHER
                        The name of the launcher script that will be created.
                        Default="<name>.slurm"
  -s                    Echoes the launcher filename to stdout.

...

Code Block
languagebash
titleCreate batch submission script for simple commands
launcher_creator.py -j simple.cmds -n simple -t 00:01:00 -a OTH21164 -q normaldevelopment
  • The name of your commands file is given with the -j simple.cmds argument option.
  • Your desired job name is given with the -n <job_name> argument simple option.
    • The <job_name> (here simple) is the job name you will see in your queue.
    • By default a corresponding <job_name>.slurm batch file is created for you.
      • It contains the name of the commands file that the batch system will execute.

...

  • In launcher_creator.py, the queue is specified by the -q argument.
    • The default queue is development. Specify -q normal for normal queue jobs.
  • The maximum runtime you are requesting for your job is specified by the -t argument.
    • Format is hh:mm:ss
    • Note that your job will be terminated without warning at the end of its time limit!

...

You may be a member of a number of different projects, hence have a choice which resource allocation to run your job under.

  • You specify that allocation name with the -a argument of launcher_creator.py.
  • If you have set an $ALLOCATION environment variable to an allocation name, it that allocation will be used if you are a member of only one TACC project.
Expand
titleOur class ALLOCATION was set in .bashrc

The .bashrc login script you've installed for this course specifies the class's allocation as shown below. Note that this allocation will expire after the course, so you should change that setting appropriately at some point.

Code Block
languagebash
titleALLOCATION setting in .bashrc
# This sets the default project allocation for launcher_creator.py
export ALLOCATION=OTH21164


  • When you run a batch job, your project allocation gets "charged" for the time your job runs, in the currency of SUs (System Units).
  • SUs are related in some way to node hours, usually 1 SU /= 1 node hour.

Tip
titleJobs tasks should have similar expected runtimes

Jobs should consist of tasks that will run for approximately the same length of time. This is because the total node hours for your job is calculated as the run time for your longest running task (the one that finishes last).

For example, if you specify 64 100 commands and 99 finish in 2 seconds but one runs for 24 hours, you'll be charged for 64 100 x 24 node hours even though the total amount of work performed was only ~24 hours.

...

One of the most confusing things in job submission is the parameter called wayness, which controls how many tasks are run on each computer compute node.

  • Recall that there are 68 128 physical cores and 96 256 GB of memory on each compute node
    • so technically theoretically you can could run up to 68 128 commands on a node, each with ~1.4 ~2 GB available memory
    • you usually run fewer tasks on a node, and when you do, each task gets more resources

...

Expand
titleAnswer

Find the number of lines in the wayness.cmds commands file using the wc (word count) command with the -l (lines) option:

Code Block
languagebash
titleALLOCATION setting in .bashrc
wc -l wayness.cmds

The file has 16 lines, representing 16 tasks.

...

Code Block
languagebash
titleCreate batch submission script for wayness example
launcher_creator.py -j wayness.cmds -n wayness -w 4 -t 00:02:00 -a OTH21164 -q normaldevelopment
sbatch --reservation=CoreNGSday2 wayness.slurm
showq -u

Exercise: With 16 tasks requested and wayness of 4, how many nodes will this job require? How much memory will be available for each task?

...