BacSeq
Introduction
The purpose of this guide is to aid current and future Whiteley Lab members and University of Texas microbiologists with bacterial RNA?Seq analysis. Once you have analyzed your data with this pipeline, you will have files identifying differentially expressed genes and files that can be used to identify novel non-coding RNAs, transcriptional start sites, and operons. Throughout this guide I will provide hyperlinks to valuable resources and programs that will help get your analysis off the ground and running.
The pipeline is Unix-based, and it runs on the Lonestar TACC super computer cluster. Unix can be intimidating; however, if you follow this tutorial (Unix tutorial) starting on page 15 you will learn all of the basic commands necessary to carry out the analyses outlined here. Earlier sections of the tutorial tell you what you need to do to run Unix on a Mac or PC. Throughout this guide text typed into the command line will appear in dashed boxes, while comments describing what you are typing will be preceded by a “#” and precede the commands inside the boxes:
# This is a comment This is what you type into the command line
This pipeline is set up to analyze RNA-Seq libraries prepared with the NEBNext Multiplex Small RNA Library Prep Set for Illumina (NEB #E7300S) and is currently limited to Pseudomonas aeruginosa PAO1, P. aeruginosa PA14, Aggregatibacter actinomycetemcomitans D7S-1, Streptococcus gordonii Challis CH1, and Escherichia coli K12 W3110. The pipeline can be easily modified in the future to accommodate new library preparation methods and new bacterial strains. Parts of the pipeline that can be updated will be annotated in this guide.
Methods Overview
A basic RNA-Seq experiment will have two experimental conditions (i.e. treated and untreated, wild type and mutant, control and test, etc.), with two biological replicates for each experimental condition. Obviously, increasing the number of replicates will increase your statistical significance. To begin analysis, you will need your fastq sequencing read files for all of your conditions and replicates. The files are processed by a series of scripts that I have written. These scripts produce files containing the mapped sequencing reads, differentially expressed gene tables, and several graphs from the differential gene expression analyses.
- mapRNASeq.sh - This script reads in a fastq file, producing a sorted bam file and indexed bam file that can be read into the Integrative Genomics Viewer and a counts file that contains the number of reads mapping to each gene in your genome.
- calcRNASeq.sh - This script reads in your individual counts files, joins them into a table, produces another table with the differential expression of all of your genes, and several graphs that summarize these results.
Getting Started
Obtain fourierseq and lonestar accounts
In order to analyze your data you will need accounts on two servers at UT: fourierseq and lonestar. Fourierseq is managed by the University of Texas Genome Sequencing and Analysis Facility (UTGSAF) and is where fastq files are stored after sequencing runs are completed. Lonestar is the super computer cluster where you will conduct your analyses and it is managed by the Texas Advanced Computing Center (TACC).
Follow this link to obtain a fourierseq account: [fourierseq account details|]
Follow this link to obtain a lonestar account: lonestar account details
Following the request for the lonestar account, you will have to obtain a lab allocation or get added to your lab's Lonestar allocation (i.e. WhiteleyLabNGS).
Note that your lonestar account has three main directories associated with your new username: home, work, and scratch. The home directory has very limited storage space (~4 GB), and should be used primarily for installing new software that you may want to try out. The work directory has more space (~250 GB) and is backed up. The scratch directory has unlimited storage, but files that have not been used in 10 days may be deleted without warning. Therefore you will want to use the scratch directory for most of your analyses, and then transfer your completed data analysis files back to your work directory or to your personal computer, so that they are not deleted by TACC.
Examples:
/home1/8675309/jenny
/work/8675309/jenny
/scratch/8675309/jenny
Log into lonestar and set up your profile
Log into the Lonestar server.
# On a Mac: Open your Terminal and type the following. Substitute youraccount with # whatever you set your username to be (i.e. jenny, pjorth, etc). ssh youraccount@lonestar.tacc.utexas.edu
You will be prompted for your password. Type it, and hit return.
Follow the steps on this wiki page to make a profile that will give you access to a ton of tools that you will need throughout the analysis pipeline ([how to set up a profile|]). This step is absolutely imperative. If your profile is not set up properly, then the pipeline will not work.
Add the following text to the end of your profile so that you can use the scripts that I wrote. Basically it tells your computer where to find the scripts without having to type in the whole path to their location in my directory. It’s like putting a shortcut on your desktop.
# You can update your profile with nano text editor. To do so, type in the command # below, copy and paste the subsequent text to the end of the and save it by # typing CTRL+o followed by return and then CTRL+x to exit the nano text editor nano .profile PATH=$PATH:/home1/02173/pjorth/local/bin export LD_LIBRARY_PATH=/home1/02173/pjorth/local/lib/ export PYTHONPATH=$PYTHONPATH:$BI/lib/python2.7/site?packages module load bowtie/2.0.2
After you have updated your profile, use the following command to reset it and allow it to use the new features.
# This will only work in your home directory source .profile
Create links (shortcuts) to your scratch and work directories for easier access.
# Type in these commands while in your home directory, it will make your work and # scratch directories easier to access, especially when uploading your files ln -s $SCRATCH scratch ln -s $WORK work
Navigating lonestar
There are a couple of commands that will help you quickly change directories.
# Change to your home directory cdh # Change to your work directory cdw # Change to your scratch directory cds
The Analysis
Download your sequencing data and upload to lonestar
The fastest way to download your files from fourierseq to lonestar is by using the UNIX program scp. This allows you to copy files from a secure location (i.e. fourierseq) to your current workspace (i.e. lonestar). Here is how I would transfer my files from fourierseq.
After your sequencing is done, you will receive an email from the GSAF telling you where to find your files (AKA the path to your files). It will say something like this:
“Data from your illumina job JA12396 is now available on the fourierseq server at:
/raid/Proj_EXAMPLE”
To get this data onto lonestar you will need to login, change to your scratch directory, and use scp to copy your files.
# login, enter your password when prompted ssh jenny@lonestar.tacc.utexas.edu # change to your scratch directory cds # OPTION 1: copy files one at a time using scp scp jenny@fourierseq.icmb.utexas.edu: /raid/Proj_EXAMPLE/file1.fastq ./ # OPTION 2: copy the whole directory containing multiple files scp -r jenny@fourierseq.icmb.utexas.edu: /raid/Proj_EXAMPLE/ ./
In the examples above you are telling scp where to find the sequencing data by using your fourierseq username and the path to the files on fourierseq.
There are several ways to transfer your fastq files from fourierseq to lonestar for your analysis. An alternate easier (but much slower) way to transfer files is to use a program with a graphical user interface to log into fourierseq, transfer your files to your personal computer, and then upload the files to lonestar for analysis. On a mac you can do this with a program called Cyberduck (Cyberduck download); however, FileZilla (Download FileZilla) also works very well and is available for PC. With either of these programs, set up an SFTP (Secure File Transfer Protocol) connection to your account at lonestar.tacc.utexas.edu, and then you can drag and drop files from your computer to the server, or from the server to your computer, just like you would normally in your computer’s file system. Remember you will need to load your files into your scratch directory on lonestar for your analyses.
mapRNASeq.sh: mapping your sequencing reads
Basic idea:
USAGE: mapRNASeq.sh in_file out_pfx assembly threads(x)
in_file = name of the fastq input file
out_pfx = the desired prefix for all of your output files
assembly = the genome assembly that the reads should be mapped to
threads(x) = the number of processing threads to use
# An example: mapRNASeq.sh my.fastq Library1 AAD7S 4
In the example above my.fastq will be mapped to the Aa D7S-1 genome using 4 processors, and all of the output files will begin with Library1.
Options for assemblies:
A. actinomycetemcomitans D7S-1 AAD7S
E. coli K12 W3110 ECK12W3110
P. aeruginosa PAO1 PAO1
P. aeruginosa PA14 PA14
S. gordonii Challis CH1 SGCH1
How to run it:
To run the script properly on lonestar you have to run the script from a commands file in the directory containing your fastq files. So create a text file called commands containing the following text. This can be done in the Unix terminal using nano.
# Open nano and save your file as “commands”. Files are saved in nano with CTRL+o # and you can exit nano with CTRL+x. In the example below we have a control # condition and a test condition with two replicates each. We are mapping to the # PAO1 genome and using 12 processing threads (on lonestar always use 12). nano mapRNASeq.sh Control1.fastq Control1 PAO1 12 mapRNASeq.sh Control2.fastq Control2 PAO1 12 mapRNASeq.sh Test1.fastq Test1 PAO1 12 mapRNASeq.sh Test2.fastq Test2 PAO1 12
Create a launcher to run your commands file.
# The launcher command will adapt your commands to a format that lonestar can # interpret. If you put in your email address lonestar will email you when your job # begins and ends. You also have to designate the time you anticipate that the job # will take to run following “-t” hh:mm:ss, usually 1 hour is more than enough. The “-a” # refers to the allocation that should be charged for the run. This will always be # WhiteleyLabNGS unless you come from a different lab. Provide the name of your # commands file following the “-j” and finally the name you want to call your job after # the “-n”. launcher_creator.py -e yourname@email.com -q normal -t 06:00:00 -a WhiteleyLabNGS -j commands -n jobname
The launcher creator will make a file called launcher.sge, this is the file that will do the work. You have to edit one line in the launcher.sge file to make sure that each of your mapping tasks runs as fast as possible. This is the line that tells the lonestar how many processors to assign your job, and it should read, “#$ -pe 12way 12”, you will want change it to, “#$ -pe 1way x”, where x is equal 12 x the number of lines in your commands file. If you have 1 line in you commands file it should be “1way 12”, 4 commands would be “1way 48”, 12 commands would be “1way 144”, etc.
# Update the launcher.sge file with nano text editor and change the line below: nano launcher.sge
old text:
#$ -pe 12way 12
new text for 1 line commands file:
#$ -pe 1way 12
new text for 3 line commands file:
#$ -pe 1way 36
etc.
Now you can run your launcher file.
qsub launcher.sge
Now your analysis for all four files is running simultaneously on four separate computing nodes with 12 processors on each node. You can check the status of your analysis from the terminal, but you will also receive an email when your job is complete.
qstat
If you realize you have made a mistake and need to cancel your job. You can do so with the “qdel” command. This is accomplished by using the job number sent to you via email, and also available when typing qstat.
# Use your job number, “123456” is only an example qdel 123456
What do I get out of this?
Once your job is completed you will receive an email and you should find the following files in your work directory. The log files for each fastq file that you analyze have important statistics from Flexbar about reads removed by trimming, and there are also important mapping statistics from Bowtie2.
Control1.trim.fastq
Control1.sam
Control1.sorted.bam
Control1.sorted.bam.bai
Control1.count.txt
Control1.log.txt
Control2.trim.fastq
Control2.sam
Control2.sorted.bam
Control2.sorted.bam.bai
Control2.count.txt
Control2.log.txt
Test1.trim.fastq
Test1.sam
Test1.sorted.bam
Test1.sorted.bam.bai
Test1.count.txt
Testl1.log.txt
Test2.trim.fastq
Test2.sam
Test2.sorted.bam
Test2.sorted.bam.bai
Test2.count.txt
Test2.log.txt
calcRNASeq.sh: joining your count files and calculating differential expression
Basic idea:
USAGE: calcRNASeq.sh -o OUT_PFX -c CONTROL_PFX -x [#] -t TEST_PFX -y [#] <file1> <file2> <file3> … <filen>
-o OUT_PFX substitute “OUT_PFX” with the prefix you want for all of your output file
-c CONTROL_PFX substitute “CONTROL_PFX” with the prefix for your control condition
-x [#] substitute “[#]” with the number of control condition replicates
-t TEST_PFX substitute “TEST_PFX” with the prefix for your test condition
-y [#] substitute “[#]” with the number of test condition replicates
<file1>… <filen> list your count files, with control conditions listed first followed by test condition files
How to run it:
This is a little simpler to run than the mapRNASeq.sh. You don’t need to create a commands file. You can run it directly from the command line, because it does not require as much computational power. Make sure you use the flags (i.e. “-o”, “-c”, etc).
# An example, continuing from mapRNASeq.sh above calcRNASeq.sh -o Exp1 -c Control -x 2 -t test -y 2 Control1.count.txt Control2.count.txt Test1.count.txt Test2.count.txt
This script takes any number of count files and joins them together in one count table, where the first column is the locus tags for the genome and the following columns contain the read counts for each locus for each condition. Then, once the table is created in the proper format it determines differential expression using the R package DESeq. This takes your joined count file with all of your conditions and replicates, normalizes the total counts for each condition/replicate, and calculates differential gene expression using Fisher’s exact test and a negative binomial distribution.
What do I get out of this?
This script creates a log file, and three table files that you will find in your current directory that summarize your results.
Exp1.log.txt
Exp1.joined.counts.txt
Exp1_normCounts.csv
Exp1_DESeq.txt
It also creates several graphs that summarize the fit of the negative binomial distribution to the data, the differential gene expression for each gene plotted against total read counts for each gene, and the p-value distribution for the differential gene expression for all genes.
Exp1_dispEsts.png
Exp1_DEplot.png
Exp1_pvals.png
Post-pipeline analyses
Differential expression analysis
Typically, after you have run your analysis you will want to find out which genes are differentially expressed under your condition of interest. To do this, use Cyberduck to download your file equivalent to Exp1_DESeq.txt. This is a csv file that can be opened with Excel. Using Excel you can sort genes that are upregulated or downregulated, filter on p-values, etc.
In each of the genome assembly directories in /home1/02173/pjorth/ref_genome I have saved a file called ASSEMBLY_locus_tag_products.txt (e.g. PAO1_locus_tag_products.txt). This file has an ordered list of all of the locus tags for the genome and the corresponding gene products. The products can be conveniently copied and pasted into Exp1_DESeq.txt using Excel. This way, when you are analyzing your data, you can know the product of each gene that is differentially expressed.
You can also look at your different graphs to get an idea as to how well your analyses worked. These are example graphs taken from the DESeq user manual.
Figure 1. Equivalent to Exp1_dispEsts.png. The black dots are individual gene counts plotted against dispersion, and the red line is the variance estimate.
Figure 2. Equivalent to Exp1_DEplot.png. The grey dots are genes that are not significantly differentially expressed, while red dots indicate differentially expressed genes.
Figure 3. Equivalent to Exp1_pvals.png. The histogram show the number of genes falling within different p-value ranges from 0.0-1.0.
The first graph shows how well your variance estimate models your data, the second summarizes the differential expression data, and the third graph is a histogram of the distribution of p-values for the differential expression of all of the genes. The graphs are most informative or general trends in the data.
Identifying novel non-coding RNAs
Finally, if you are interested in looking for novel non-coding RNAs that may not be annotated in your genome, you can download your .sorted.bam and .sorted.bam.bai files to view with the Integrative Genomics Viewer (IGV webpage). You will need the gff annotation and fasta sequence files for your genome that you mapped your reads to. Once these are loaded and your annotated reference genome is visible, you can open your bam file in IGV (make sure that the .sorted.bam.bai file is in the same directory, otherwise it will not work). You can do this for all of your bam files, and each condition will load as a separate “track” in IGV.
How it works
mapRNASeq.sh: the details
This script is a Unix shell script that processes a fastq file with three different programs. First it uses Flexbar (Flexbar webpage) to remove adapter sequence contamination from the read files. The trimmed fastq file is then mapped to the reference genome using bowtie2 (bowtie2 webpage). The file containing the mapped read information is subsequently processed with 1) SAMtools (SAMtools webpage) to prepare the reads to be viewed against the genome and 2) HT-Seq (HT-Seq webpage) to count the number of reads mapping to each gene in the genome. The following list gives the details for each step in the script.
- Reads in the sequencing read file in .fastq format
- Trims the reads with Flexbar. This currently trims Illumina small RNA adapter sequences that are part. So if the libraries are made with a different kit or different adapter, the sequences will need to be changed in:
/home1/02173/pjorth/adapters/3_adapter_seq.fasta
>index_sp
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>3_adapter_seq
TCGTATGCCGTCTTCTGCTTG
# The parameters for Flexbar in the script are as follows. # -f fastq-i1.8: designates that the input file is an Illumina fastq file # # -n $THREADS: designates that Flexbar runs with the user defined # number of processors # # -ao 8: designates that at least 8 bp of the adapter is required for trimming # # -m 15: designates that the minimum length of the trimmed read is 15 bp # # -ae RIGHT: designates that the adapters should be removed from # the right end of the read # # -a $WORK_COMMON/adapters/3_adapter_seq.fasta: designates # the adapter file # # -s $IN_FQ: designates that input file # # -t $OUT_PFX.trim: designates the output file flexbar -f fastq-i1.8 -n $THREADS -ao 8 -m 15 -ae RIGHT -a $WORK_COMMON/adapters/3_adapter_seq.fasta -s $IN_FQ -t $OUT_PFX.trim
OUTPUT= out_pfx.trim.fastq
After removing adapters, the trimmed reads are mapped to the genome with bowtie2 to produce a .sam file.
# The parameters for bowtie2 in the script are as follows. # -x $REFBASE/$ASSEMBLY/$ASSEMBLY # # -q: The input file is a fastq file # # -p $THREADS: designates that bowtie2 runs with the user defined # number of processors # # -k 1: only report the first position mapped for each read # # -U $OUT_PFX.trim.fastq: designates that the input file is a single # unpaired read # # -S $OUT_PFX.sam: designates the output sam file bowtie2 -x $REFBASE/$ASSEMBLY/$ASSEMBLY -q -p $THREADS -k 1 -U $OUT_PFX.trim.fastq -S $OUT_PFX.sam
OUTPUT= out_pfx.sam
The .sam file is used to count reads mapping to genes with htseq-count, a python program.
# Parameters for htseq-count # # -m intersection-nonempty: count reads if they overlap with the beginning # or end of a gene # # -t $FEATURE: counts either gene or CDS features, depending on the genome # # -i locus_tag: in the output file the first column prints the locus tag # htseq-count -m intersection-nonempty -t $FEATURE -i locus_tag $OUT_PFX.sam $REFBASE/$ASSEMBLY/$ASSEMBLY.HTSeq.gff > $OUT_PFX.count.txt
OUTPUT= out_pfx.count.txt
The .sam file is then processed with SAMtools.
# Convert the sam file to a bam file samtools view -b -S -t $REF_PFX.fai -o $OUT_PFX.bam $OUT_PFX.sam
OUTPUT= out_pfx.bam
# Sort the bam file by the genomic location of each read mapped samtools sort $OUT_PFX.bam $OUT_PFX.sorted
*OUTPUT= out_pfx.sorted.bam
# Index the sorted bam file for IGV samtools index $OUT_PFX.sorted.bam
*OUTPUT= out_pfx.sorted.bam.bai
(*These are the two files that need to be in the same folder to view the reads mapping to the genome with IGV)
calcRNASeq.sh: the details
After mapRNASeq.sh produces your .count.txt file, you need to combine that with your other replicates so that you can calculate differential gene expression using DESeq.
There are a couple of tricks with this:
- Each count file has the locus tag for each gene in column 1 and the actual counts in column 2. So to combine two counts files you want to merge column 2 from your 2nd file with columns 1 and 2 from your first. UNIX has the join command, which does exactly this.
- Join works perfectly, but the other problem is that at the end of the .count.txt file, there are ~5 lines that summarize the count information that will screw up DESeq.
- We also need to add a row to the top of the combined counts table that labels each experimental condition in each column. For a typical RNA-Seq experiment you will have two replicates for two experimental conditions.
To solve all of these problems without having to mess around in excel I have written a couple of helpful shell scripts to join count files, clean up the unnecessary lines at the end, and add a row to the top that describes the conditions being used.
Next, the shell script reads your joined count file as well as several parameters into an R script that runs DESeq (DESeq webpage). DESeq is a Bioconductor R package for RNA-Seq differential gene expression analysis. There are not many tricks to this script; it essentially goes through all of the steps in the vignette, and produces a number of useful tables and graphs that summarize your data.
How to update the pipeline
Adding new reference genomes
So far I only have this set up on the server to run with P. aeruginosa _PAO1, _Aa _D7S-1, _Sg Challis CH1, P. aeruginosa PA14, and E. coli K12_ _W3110 . If other genomes are desired in the future the following directories and files need to be created (email me pjorth at gmail dot com, and I can help you!):
A directory for the genome assembly (e.g. NEWASSEMBLY, PAO1, PA14, etc), which goes in:
/home1/02173/pjorth/ref_genome/NEWASSEMBLY
# Change into the ref_genome directory cd /home1/02173/pjorth/ref_genome/ # Make the new directory mkdir NEWASSEMBLY
Example:
/home1/02173/pjorth/ref_genome/PAO1
Within the assembly directory the following files need to be present, you can download the fna and gff files from Genbank (Genbank ftp):
NEWASSEMBLY.fna (the DNA sequence for your genome)
The samtools indexed NEWASSEMBLY.fna.fai file. This is created by running samtools on the fasta genome sequence file in the ref_genome directory:
# Use SAMtools to index your fna sequence file samtools faidx NEWASSEMBLY.fna
The bowtie2 indexed files for your genome. These are created by running the following from the command line on your .fna file
# Use bowtie2-build to index your assembly for read mapping bowtie2-build NEWASSEMBLY.fna NEWASSEMBLY
- NEWASSEMBLY.gff (the gff annotation file containing your genes, their location in the genome, and the strand they are on)
- NEWASSEMBLY.HTSeq.gff (the gff annotation file including the features you want to count; I typically remove tRNA and rRNA from this file, leaving the CDS and ncRNAs. I change the ncRNAs to CDS in the gff file. This can be easily done in Excel. Sometimes you may want to count the gene features instead of CDS if HTSeq. This needs to be reflected in the updated mapRNASeq.sh script.)
- Next the mapRNASeq.sh script needs to be updated to use the new genome. This is done by adding the lines in red, where NEWASSEMBLY is the name of your NEWASSEMBLY directory in the /ref_genome directory.
# Find the path of the Bowtie reference based on the assembly # name provided. Assumes a common directory structure # At TACC the structure is rooted a common BioITeam directory. # Set WORK_COMMON appropriately outside this script if # not running at TACC. # Here the directory is set to run on the Lonestar TACC cluster : ${WORK_COMMON:=/home1/02173/pjorth} REFBASE="$WORK_COMMON/ref_genome" if [ "$ASSEMBLY" == "PAO1" ]; then REF_PFX="$REFBASE/$ASSEMBLY/$ASSEMBLY.fna"; FEATURE="gene"; elif [ "$ASSEMBLY" == "PA14" ]; then REF_PFX="$REFBASE/$ASSEMBLY/$ASSEMBLY.fna"; FEATURE="CDS"; elif [ "$ASSEMBLY" == "AAD7S1" ]; then REF_PFX="$REFBASE/$ASSEMBLY/$ASSEMBLY.fna"; FEATURE="CDS"; elif [ "$ASSEMBLY" == "SGCH1" ]; then REF_PFX="$REFBASE/$ASSEMBLY/$ASSEMBLY.fna"; FEATURE="CDS"; elif [“$ASSEMBLY” == “NEWASSEMBLY”]; then REF_PFX="$REFBASE/$ASSEMBLY/$ASSEMBLY.fna"; FEATURE="CDS"; else REF_PFX="$REFBASE/$ASSEMBLY/${ASSEMBLY}.fna"; FEATURE="CDS"; fi
Note, sometimes you will want to count reads mapping to genes instead of CDS features, and in these cases you will change the text to FEATURE=”gene”.
Presentation from UT BYTE Club meeting 20 March 2013
This powerpoint goes over some of the basics of BacSeq, including generally how it works and what you will end up getting from the pipeline.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.