Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

...

Getting started with Bosco

The Tier-3 uses utatlas.its.utexas.edu as a submission host - this is where the Condor scheduler lives.  However 

Bosco is a job submission manager designed to manage job submissions across different resources.  It is needed to submit jobs from our workstations to the Tier-3.

Make sure you have an account on our local machine utatlas.its.utexas.edu, and that you have passwordless ssh set up to it from the tau* machines.

To do this create an RSA key and copy your .ssh folder onto the tau machine using scp.

 Then carry out the following instructions on any of the tau* workstations:

Code Block
bash
bash
cd ~
curl -o bosco_quickstart.tar.gz ftp://ftp.cs.wisc.edu/condor/bosco/1.2/bosco_quickstart.tar.gz
tar xvzf ./bosco_quickstart.tar.gz
./bosco_quickstart

...

Lastly, here is a more detailed guide to Bosco

Tier 3 Tips

Here are some useful tips for running with the Cloud Tier-3:

  • The worker nodes do not mount any of our network disks. This is partly for simplicity and robustness, and partly for security reasons.  Because of this your job must transfer files either using the Condor file transfer mechanism (recommended for code and output data) or using the xrootd door on utatlas.its.utexas.edu (which gives read access to /data, through the URL root://utatlas.its.utexas.edu://data/...; recommended for input datasets).  Although this may seem somewhat unfortunate, it's actually a benefit, because any submitted job that runs properly on the Cloud Tier-3 can therefore be flocked to other sites, which obviously don't mount our filesystem, without being rewritten (see Overflowing to ATLAS Connect below).  
  • You must submit jobs in the "grid" universe (again, to enable proper flocking).  In other words, 

    Code Block
    grid_resource = batch condor ${USER}@utatlas.its.utexas.edu

    in your Condor submission file (replace ${USER} with your username).

  • The worker nodes have full ATLAS CVMFS.
  • One common problem is having jobs go into the Held state with no logfiles or other explanation of what's going on. Running condor_q -long <jobid> will give "Job held remotely for no reason."  By far (>99.9%) the most common cause of this is that a file is requested to be transferred back through the file transfer mechanism in the submission file, but is not produced in the job. That is usually caused by the job failing (unable to read input data, crash of code, etc.).  Unfortunately you won't have the logfile, so the easiest way to debug this is to resubmit the job but without the output file transfer specified in the submission script. (This is a very unfortunate and nasty feature of Bosco.)

Overflowing to ATLAS Connect

The Cloud Tier 3 is enabled to overflow user jobs to the ATLAS Connect system when not enough slots are available locally. ATLAS Connect allows multiple sites (Chicago, Illinois, Indiana, Fresno State, and us) to "flock" jobs to each other when necessary, achieving better CPU utilization and job throughput.

Due to the "disconnected" nature of our worker nodes (no mounting of our filesystems), essentially all jobs can flock from our Tier-3 to other sites. In fact the default behavior is to flock the jobs if necessary, and this is usually completely transparent. If you need to forbid jobs from flocking (pinning them to our systems), you can do this by adding the following to your Condor submission script: 

Code Block
Requirements = ( IS_RCC =?= undefined )

VM configuration

Our virtual machines are CentOS 6 instances configured with CVMFS for importing the ATLAS software stack from CERN. They also boot individual instances of the Condor job scheduling system. They access the same instance of the Squid HTTP caching server which our local workstations use (on utatlas.its.utexas.edu), which help reduce network traffic required for CVMFS and for database access using the Frontier system.

...

Code Block
bash
bash
ssh username@alamo.futuregrid.org

Then visit the list of instances to see which nodes are running. Then simply 

Code Block
bash
bash
ssh root@10.XXX.X.XX

and you are now accessing a node!