Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titlelog into lonestar 5 (aka ls5) stampede2 with the ssh command
collapsetrue
ssh <username>@stampede2.tacc.utexas.edu

...

Now that we have backed up your profiles so you won't lose any previous settings, you can copy our predefined GVA2021GVA2022.bashrc file from the /corral-repl/utexas/BioITeam/scriptsgva_course/ folder to your $HOME folder as .bashrc and the predefined GVA2021GVA2022.profile as .profile from the same location before using the chmod command to change the permissions to read and write for the user only.

Code Block
languagebash
titleCopy the course provided .profile file and change its name and permissions
collapsetrue
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021GVA2022.bashrc .bashrc
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021GVA2022.profile .profile
chmod 700 .bashrc
chmod 700 .profile

...

Info
titleUnderstanding why some files start with a "."

In the above code box, you see that the names start with a . when a filename starts with a . it conveys a special meaning to the operating system/command line. Specifically, it prevents that file from being displayed when you use the ls command unless you specifically as for hidden files to be displayed using the -a option. Such files are termed "dot-files" if you are interested in researching them further.

Let's look at a few different ways we will use the ls command throughout the course. Compare the output of the following 4 commands:

Code Block
languagebash
titleStandard output
ls              #ignore everything that comes after the # mark. There is a problem on this wiki page but things after a # wont effect commands


Code Block
languagebash
titleStandard output plus hidden files
ls -a


Code Block
languagebash
titleStandard output plus hidden files in a single column
ls -a -1


Code Block
languagebash
titleStandard output plus hidden files in a single column with additional information
ls -a -l

Throughout the course you will notice that many options are supplied to commands via a single dash immediately followed by a single letter. Usually when you have multiple commands supplied in this manner you can combine all the letters after a single dash to make things easier/faster to type. Experiment a little to prove to yourself that the following 2 commands give the same output.

Code Block
languagebash
titleStandard output plus hidden files in a single column
ls -a -1

ls -al

While knowing that you can combine options in this way helps you analyze data faster/better, the real value comes from being able to decipher commands you come across on help forums, or in publications.

For ls specifically the following association table is worth making note of, but if you want the 'official' names consider using the man command to bring up the ls manual.

flagassociation
-a"all" files
-l"long" listing of file information
-11 column
-hhuman readable



Getting back to your profile... Since .bashrc is executed when you login, to ensure it is set up properly you should first logout:

Code Block
languagebash
titleHow to leave Lonestar stampede2 by logout or exit from a remote connection
collapsetrue
logout
# or
exit

...

Code Block
languagebash
titleGo log back in to Lonestarstampede2
collapsetrue
ssh <username>@stampede2.tacc.utexas.edu

...

Code Block
titleCreating a shortcut to the main Lonestar Stampede2 working directories
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam

Several In previous years, several people have report seeing an error message stating "ln: failed to create symbolic link 'BioITeam/BioITeam': Permission denied." This is being investigated, but is not expected to impact today's tutorialseems to be related to different project allocations. I do not think it will be an issue for anyone this year.

  • Understanding what your .bashrc file actually does.

Expand
titleWhile interesting and useful information to have, understanding it is not critical to variant analysis. I suggest you to look through this information after you complete the rest of the tutorial, in your free time, or when you need to modify your profile or bashrc files in the future.


language

This page maybe useful if you want to further customize your prompt after the course.

Info

Let's look at what your .bashrc profile actually does. Use the cat command to print contents of the .bashrc file to the screen.

Code Block
languagebash
titlePrint the contents of the .profile file to the screen
cat .bashrc

This will print several lines of text to the terminal window. Let's look at what some of these lines do with a little more information:

  • lines that start with #

    • Any line begins with a # symbol, is "commented out". Anything after a # symbol will not be executed by any program. Programers commonly make use of behavior to leave notes for others, or even themselves at a later date as to what particular lines of a script are actually doing.
  • Section 1 has multiple lines involving "module load <NAME>"

    • This loads different modules by default. We have included basic ones that will help with basic TACC things. After we review the use of the nano text editor we'll go into more depth with TACC modules. But for now trust us when we say that not having to load a bunch of modules every time you log into TACC is a good thing.

    • In previous years the module system was used more extensively. Here we will attempt to We now rely more on miniconda installations for increased portability. If you find yourself working within TACC (or equivalent resources), the module system (or similar systems) can be very advantageous. 
  • Section 2 has multiple lines starting with "export"

    • The export lines define shell variables for example BI and PATH. You've already seen how using $BI can come in handy accessing our shared course directory. As for PATH, that is a well-known environment variable that defines a set of directories where the shell will look when you type in a program's name. Our shared profile adds the common course directories that we copied at the start of this tutorial and your local ~/local/bin directory (which does not exist yet) to the location list. You can see the entire list of locations by doing this:

      Code Block
      languagebash
      titleHow to see where the bash shell looks for programs
      echo $PATH

      As you can see, there are a lot of locations on the path. That's because when you load modules at TACC (see above), that mechanism makes the programs available to you by putting their installation directories on your $PATH.

  • umask 002

    • The umask command is used to set the default permissions of newly created files and directories limiting the need to use the chmod command. umask functions as the inverse of chmod meaning that it subtracts the values from the default permissions. In this case the command umask 002 is the equivalent of the command chmod 775 for directories, and chmod 664 for files. in summary, having this command in your .profile gives all new files you create read and write access to both you and your group while giving read only access to everyone else.
  • PS1='tacc:\w$ '

    • The PS1='tacc:\w$ ' line is a special setting that tells the shell to display the current directory as part of its prompt. It saves you typing pwd all the time to see where you are in the directory hierarchy. Try using the mkdir command to make a new directory called tmp and and change into that directory to see what it does to your prompt.

Code Block
Code Block
languagebash
titleSee how your prompt reflects your current directory
collapsetrue
mkdir tmp
cd tmp


  • Your prompt should have changed from: "tacc:~$"to now be "tacc:~/tmp$". Your prompt now tells you you are in the tmp subdirectory of your home directory (~). See if you can figure out how to return to your home directory without expanding the code block. Expand the following code block to see the different ways of returning to your home directory.

    Code Block
    languagebash
    titleHow to return to your home directory
    collapsetrue
    cd
    cdh
    cd $HOME
    cd ~
    cd -

    The last example in the above code block will return you to your previous directory. In this case, that means the home directory, but it can be very useful in other situations when you change directories to do something in 1 place then need to hop back to where you were, or if you mistakenly leave a directory.


  • ...

    • Linux text editors installed at TACC (nanoviemacs). These run in your terminal window. vi and emacs are extremely powerful but also quite complex, so nano is is the best choice as a first local text editor. It is also powerful enough that you can still accomplish whatever you are working on, it just might be more difficult if you try to do more complex edits. If you are already familiar with one of the other programs you are welcome to continue using it.
    • Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer (Notepad++ or Komodo Edit). Once you connect to the remote host, you can navigate its directory structure and edit files. When you open a file, its contents are brought over the network into the text editor's edit window, then saved back when you save the file.
    • Software that will allow you to mount your home directory on TACC as if it were a normal disk e.g. MacFuse/MacFusion for Mac, or ExpanDrive for Windows or Mac ($$, but free trial). Then, you can use any text editor to open files and copy them to your computer with the usual drag-drop.

    ...

    1. The most important thing to get used to is the convention of using . _  or capitalizing the first letter in each word in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folders, 1 named Bio another named I and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is correctly telling it to do what you want it to do". 
    2. Name things something that makes it obvious to you what the contents are not just today but next week, next month, and next year even if you don't touch the it for weeks-months-years.
    3. Prefixing file/folder names with international date format (YYYY-MM-DD) will ensure that listing the contents will print in an order in which they were created. This can be useful when doing the same or similar analysis on new samples as new data is generated.

    ...

    Stampede2 is a computer cluster connected to three file servers (each with unique characteristics), and other computer infrastructure. For the purpose of this class, and your own work, you only need to understand the basics of the 3 file servers to know how to use them effectively. The 3 servers are named, "HOME", "WORK2WORK", and "SCRATCH", and we will work with them all over the next 5 days


    $HOME

    $WORK2$WORK

    $SCRATCH

    Purged?

    No

    No

    Files can be purged if not accessed for 10 days.

    Backed Up?

    Yes

    No

    No

    Capacity

    10GB

    1TB

    Basically infinite.

    Commands to Access

    cdh

    cd $HOME/

    cdw

    cd $WORK/

    cds

    cd $SCRATCH/

    Purpose

    Store Executables

    Store Files and Programs

    Run Jobs 

    Time spentWhen modifying basic settingsWhen installing new programs;
    Storing raw or final data
    When analyzing data

    ...

    Code Block
    languagebash
    titleExample command for copying data from a $WORK directory to $SCRATCH . This command is only an example of something you may use in the future. As you do not have any fastq files on $WORK, or at least likely do not have them in a folder titled 'my_fastq_data' if you tried this command you would be expected to get a message stating no such file or directory found.
     cp cp $WORK2$WORK/my_fastq_data/*fastq $SCRATCH/my_project/
    

    ...

    Tip
    titleUsing the version numbers for module commands

    While not always strictly necessary, using the version number (in this case "/2.3.2") is a very good habit to get into as it controls what version is to be loaded. In this case the because there are 2 very different versions available (2.3.2 and 1.2.1.1)  module load bowtie will actually throw an error which tells you to use the module spider command to figure out how to correctly load the module. 

    While it is tempting to only use "module load name" without the version numbers, using the version numbers can help keep track of what versions were used for referencing in your future publications, and make it easier to identify what went wrong when scripts that have been working for months or years suddenly stop working (ie TACC changed the default version of a program you are using).

    This is one of the big advantages of using the conda system we will describe shortly, it easily keeps track of all versions of all programs you use.


     Since the module load command doesn't give any output, it is often useful to check what modules you have installed with either of the following commands:

    ...

    Code Block
    languagebash
    titleUsing the mkdir command to create a folder named 'src' inside of your $WORK2 $WORK directory
    collapsetrue
    cd $WORK2$WORK
    mkdir src
    cd src


    Code Block
    languagebash
    titleUse the wget command to download the linux installer directly to your current directory
    collapsetrue
    wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2latest-Linux-x86_64.sh

    You should see a download bar showing you the file has begun downloading, when complete the ls command will show you a new compressed file named 'Miniconda3-py39_4.9.2latest-Linux-x86_64.sh'

    Using scp.

    This is not necessary if you followed the wget commands above. Again In a new browser or tab you would navigate to https://docs.conda.io/en/latest/miniconda.html but instead of right clicking on the "Miniconda3 Linux 64-bit" in the linux installers section and choosing copy link address you would simply left click and allow the file to download directly to your browser's Downloads folder. Using information from the SCP tutorial you would then transfer the local 'Miniconda3-latest-Linux-py39x86_4.9.2-Linux-x86_64.64.sh' file to the stampede2 remote location '$WORK2$WORK/src'. Note that you are downloading a file that will work on TACC, but not on your own computer. Don't get confused thinking you need windows or mac versions.

    Given that the wget command doesn't involve having to use MFA, or the somewhat cumbersome use of 2 differnt different windows, and is subject to many fewer typos, hopefully you see how wget is preferable provided left clicking on a link directly downloads a file.

    ...

    Code Block
    languagebash
    titleThe following command is then used to install miniconda
    bash Miniconda3-py39_4.9.2latest-Linux-x86_64.sh
    logout
    #log back in using the ssh command. 
    conda config --set auto_activate_base false

    ...

    For help with the ssh command please refer back to Windows10 or MacOS tutorials. If you log out and back in 1 more time, what do you notice is different?

    ...

    Code Block
    languagebash
    titleusing the conda create command, make a new environment named "GVA2021GVA-fastqc", and activate it
    conda create --name GVA2021GVA-fastqc
    # enter 'y' to proceed
    conda activate GVA2021GVA-fastqc

    This will once again change your prompt. This time the expected prompt is:

    No Format
    (GVA-fastqc) tacc:~$

    Again if you see something different, you need to get the instructors attention. For the rest of the course, it is assumed that your prompt will start with (GVA2021GVA-program_name) if not, remember that you need to use the conda activate GVA2021 command GVA-program_name command to enter the environment.

    ...

    The anaconda or miniconda interfaces to the conda system is becoming increasingly popular for controlling one's environment, streamlining new program installation, and tracking what versions of programs are being used. A comparison of the two different interfaces can be found here. The The deciding factor on which interface we will use is hinted at, but not explicitly stated in the referenced comparison: TACC does not have a GUI and therefore anacondaa anaconda will not work, which is why we installed miniconda above.

    Similar to the module system that TACC uses, the "conda" system allows for simple commands to download required programs/packages, and modify environmental variables (like $PATH discussed above). Two huge advantages of conda over the module system, are: #1 instead of relying on the employees at TACC to take a program and package it for use in the module system, anyone (including the same authors publishing a new tool they want the community to use) can create a conda package for a program; #2 rather than being restricted to use on the TACC clusters, conda works on all platforms (including windows and macOS), and deal with all the required dependency programs in the background for you. 

    ...

    Code Block
    languagebash
    titleattempt to install the fastqc program using conda
    conda activate GVA2021GVA-fastqc
    
    conda install fastqc
    

    If you have already activated your GVA2021 GVA-fastqc environment, the first line will not do anything, but if you have not, you will see your promt has changed to now say (GVA2021GVA-fastqc) on the far left of the line. As to the second command, like we saw with the module system above, things aren't quite this simple. In this particular case, we get a very helpful error message that can guide our next steps:

    ...

    More information about "channels" can be found here. By the end of this course you may find that the 'bioconda' channel is full of lots of programs you want to use, and may choose to permanently add it to your list of channels so the above command conda install fastqc and others used in this course would work without having to go through the intermediate of searching for the specific installation commands, or finding what channel the program you want is in. Information about how to do this, as well as more detailed information of why it is bad practice to go around adding large numbers of channels can be found here.
    For now, use Similarly, when we get to the read mapping tutorial, we will go over the conda-forge channel which is also very helpful to have.

    For now, use the error message you saw above to try to install the fastqc program yourself.

    ...

    No Format
    The following packages will be downloaded:      
    
    
        package                    |            build
        ---------------------------|-----------------
        dbus-1.13.18               |       hb2f20db_0         504 KB
        fastqc-0.11.9              |       hdfd78af_1         9.7 MB  bioconda
        font-ttf-dejavu-sans-mono-2.37|       h6964260hd3eb1b0_0         335 KB
        ------------------------------------------------------------
      glib-2.69.1                |       h4ff587b_1         1.7 MB
        libxml2-2.9.14             |       h74e7548_0         718 KB
        openjdk-11.0.13            |       h87a67e3_0       341.0 MB
        ------------------------------------------------------------
                                               Total:       354.0 MB  
    
    
    The following NEW packages will be INSTALLED:    
      _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
      _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu
      dbus               pkgs/main/linux-64::dbus-1.13.18-hb2f20db_0
      expat              pkgs/main/linux-64::expat-2.4.4-h295c915_0
      fastqc             bioconda/noarch::fastqc-0.11.9-hdfd78af_1
      font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-hd3eb1b0_0
      fontconfig         pkgs/main/linux-64::fontconfig-2.13.1-h6c09931_0
      freetype           pkgs/main/linux-64::freetype-2.11.0-h70c0345_0
      glib               pkgs/main/linux-64::glib-2.69.1-h4ff587b_1
      icu                pkgs/main/linux-64::icu-58.2-he6710b0_3
      libffi             pkgs/main/linux-64::libffi-3.3-he6710b0_2
      libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1
      libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1
      libpng             pkgs/main/linux-64::libpng-1.6.37-hbc83047_0
      libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1
      libuuid            pkgs/main/linux-64::libuuid-1.0.3-h7f8727e_2
      libxcb             pkgs/main/linux-64::libxcb-1.15-h7f8727e_0
      libxml2            pkgs/main/linux-64::libxml2-2.9.14-h74e7548_0
      openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0
      pcre               pkgs/main/linux-64::pcre-8.45-h295c915_0
      perl         Total:        10.0 MB
    
    The following NEW packages will be INSTALLED:pkgs/main/linux-64::perl-5.26.2-h14c3975_0
      xz          fastqc             bioconda/noarch::fastqc-0.11.9-hdfd78afpkgs/main/linux-64::xz-5.2.5-h7f8727e_1
      font-ttf-dejavu-s~ pkgs/main/noarch::font-ttf-dejavu-sans-mono-2.37-h6964260_0zlib   openjdk            pkgs/main/linux-64::openjdkzlib-81.02.15212-h7b6447c_3h7f8727e_2
       
    
    Proceed ([y]/n)? y
    
    
    Downloading and Extracting Packages
    fastqc-0.11.9        | 9.7 MB    | ####################################################################################################################################################################################### | 100% 
    font-ttf-dejavu-sans | 335 KB    | ####################################################################################################################################################################################### | 100% 
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done

    ...

    This is about using the git clone command. Git is a command often used for collaborative program development or sharing of files. Some groups also put the programs or scripts associated with a particular paper on a github project and publish the link in their paper or on their lab website. Github repositories are a great thing to add to a single location in your $WORK2 $WORK directory.

    Here we will clone the github repository for the E. coli Long-Term Evolution Experiment (LTEE) originally started by Dr. Richard Lenski. These files will be used in some of the later tutorials, and are a good source of data for identifying variants in NGS data as the variants are well documented, and emerge in a controlled manner over the course of the evolution experiment. Initially cloning a github repository as exceptionally similar to using the wget command to download the repository, it involves typing 'git clone' followed by a web address where the repository is stored. As we did for installing miniconda, with wget we'll clone the repository into a 'src' directory inside of $WORK2$WORK.

    Code Block
    languagebash
    titleUsing the mkdir command to create a folder named 'src' inside of your $WORK2 directory
    collapsetrue
    cd $WORK2$WORK
    mkdir src
    cd src

    If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created. 

    ...

    In previous years, the pip installation program was used to install a few programs. While those programs will be installed through conda this year, the link here is provided to give a detailed walk through of how to use pip on TACC resources. This is particularly helpful for making use of the '--user' flag during the installation process as you do not have the expected permissions to install things in the default directories.

    This concludes the the linux and stampede2 refresher/introduction tutorial.

    Genome Variant Analysis Course 2021 2022 home.