Page Comparison

A healthy taste of resources available, specifically for this course - not a comprehensive catalog.

Table of Contents

Linux/TACC

Linux fundamentals on this wiki
Wikis for the 3 CBRS Unix/Linux workshops:
Online tutorials:
- Ryan's Linux Tutorial: http://ryanstutorials.net/linuxtutorial/
- Unix and Perl bootcamp for Biologistsbiologists: http://korflab.ucdavis.edu/unix_and_Perl/

Community Resources

SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data: bootcamp.html
- Unix primer (longer version) for biologists:
  - http://
  seqanswers
  - korflab.ucdavis.
  com/forums/showthread.php?t=4589
  - edu/Unix_and_Perl/unix_and_perl_v2.3.4.pdf

Community Resources

UCSC Genome Browser - visualize and download NGS data (see more below)Galaxy website for online sequencing data analysis
Broad Institute Integrated Genomcs Genomics Viewer (IGV)
- especially good for visualizing BAM file details
Michigan State University ANGUS resources
A list of their tutorials: http://ged.msu.edu/angus/
2012 Next-Gen Sequence Analysis Workshop a similar tutorial to our course

Introduction to Sequence analysis in the Amazon EC2 cloud

where you can "rent" Linux machines (useful if you don't have access to TACC or BRCF pods)

Galaxy website for online sequencing data analysis
SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data: http://seqanswers.com/forums/showthread.php?t=4589

Sequencing Technologies

Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer single molecule sequencing
  - Oxford Nanopore
  - PacBio SMRT system PCR-free protocol
- Single cell sequencing
  - 10x Genomics platformplatforms
- Older technologies (less not common now)
  - Life Technologies SOLiD (short reads in "colorspace")
  - Roche/454 – long (multmulti-Kb) reads often used in assemblies

...

FASTQ analysis/manipulation/QC

Wikipedia FASTQ format page
Illumina library construction on GSAF user wiki - useful for contaminant detection or adapter removal
FastQC from Babraham Bioinformatics – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- produces nice quality report for fastq FASTQ files
MultiQC – http://multiqc.info/
- A great tool for consolidating QC multiple QC reports into one HTML page
- Anna's Byte Club tutorial on using MultiQC – https://wikisutexas.utexasatlassian.edunet/wiki/display/bioiteam/Using+MultiQC
cutadapt – https://cutadapt.readthedocs.io/en/stable/
- An excellent command line tool for adapter sequence removal
- Good support for trimming paired-end datasetsAvailable at TACC at /work/projects/BioITeam/ls5/opt/cutadapt-1.10/cutadaptalso needs this $PYTHONPATH modification:export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/site-packages:$PYTHONPATH"
- Script that handles the details of paired-end read trimming
  - /workwork2/projects/BioITeam/common/script/trim_adapters.sh
trimmomatic – http://www.usadellab.org/cms/?page=trimmomatic
- Supports trimming paired-end datasets. I haven't used it but it seems to be popular.
fastx toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
- Command Suite of command line tools for fastq FASTQ and FASTA analysis and manipulation
- Good for hard clipping. Available at TACC., FASTA file manipulations
- Documentation at: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
seqtk – https://github.com/lh3/seqtk
- Suite of command line tools for FASTQ and FASTA analysis and manipulation

Reference genomes

Gencode – https://www.gencodegenes.org/
- reference genomes, transcriptomes and high-quality annotations for human and mouse
- https://www.gencodegenes.org/releases/current.html
UCSC downloads – http://hgdownload.cse.ucsc.edu/downloads.html
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
Ensembl downloads – ftphttp://ftp.ensembl.org/pub/
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
NCBI GenBank Nucleotide collection
- RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
  - well curated genome, transcriptome sequences
- GenBank – https://www.ncbi.nlm.nih.gov/
nuccore
- genbank/
  - public repository for sequence data, especially for prokaryotic genomes
  - not curated
Reference genome vocabulary – https://software.broadinstitute.org/gatk/documentation/article?id=7857
- excellent introduction to the types of genome references and the vocabulary used to describe them
  - aimed at higher eukaryotes but vocabulary useful nonethelenonetheless
GATK blog describing ALT contigs in GRCh38 – https://software.broadinstitute.org/gatk/blog?id=8180
Support for mapping to ALT contigs containing variants
- bwa mem + bwakit by Heng-Li – https://github.com/lh3/bwa/blob/master/README-alt.md

Basic alignment and aligners

Comparison of different aligners
- by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
File formats
- input: fastq FASTQ format
- output: the SAM (Sequence Alignment Map) format specification (SAM1.pdf)
  - SAM1.pdf – header fields, body fields, flag definitions
  - https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf – tag fields
Aligners
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
  - fast, sensitive and easy to use
- bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
  - fast, sensitive and extremely configurable
Comparison of different aligners
- by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
The BioITeam has some TACC-aware alignment scripts you might find useful:
- bwa alignment
  - /work/projects/BioITeam/common/script/align_bwa_illumina.sh
- bowtie2 alignment
  - /work/projects/BioITeam/common/script/align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
  - /work/projects/BioITeam/common/script/merge_sorted_bams.sh
- kallisto pseudo-alignment to annotated transcripts
  - /work/projects/BioITeam/common/script/run_kallisto.sh
- also available on many BRCF pods under /mnt/bioi/script.
  - many pre-built references also available in /mnt/bioi/ref_genome
- email or come talk to Anna if you have questions or problems

Transcriptome-aware aligners

HISAT2 – https://ccbdaehwankimlab.jhugithub.edu/softwareio/hisat2/index.shtml
- new and fast, with support for alignment to single and "population" of genomes
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
STAR (Spliced Transcripts Alignment Spliced Transcripts Alignment to a Reference Reference) – ultra-fast RNAseq RNA-seq aligner
- releases: https://github.com/alexdobin/STAR/releases
- paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/
- STAR manual (pdf): https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
kallisto - https://pachterlab.github.io/kallisto/about
- ultra-fast RNA-seq pseudoaligner that goes straight from fastq FASTQ to estimated transcript abundances

...

SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
  - type in a decimal number to see which flags are set
samtools – by Heng Li
- samSAM/bamBAM conversion, flag filtering, sorting, indexing, duplicate filtering
- older 0.1.xx versions: http://samtools.sourceforge.net/
- newer 1.x3+ versions: http://www.htslib.org/
Picard toolkit – http://broadinstitute.github.io/picard/
- samSAM/bam utilities BAM utilities that are read-group aware
- especially MarkDuplicatesand MarkDuplicatesWithMateCigar for flagging duplicate alignments
  - http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar
SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for sam/bam files.
bedtools – http://bedtools.readthedocs.org/en/latest/
- All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
- Swiss army knife for all manner of common bedcommon BED, bamBAM, vcfVCF, gfffile manipulation such as:
- intersecting bam or bed with annotation files
  - bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)
- merging overlapping regions
  - bedtools merge (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html)
- format conversion
  - bedtools bamtobed (http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html)
  - bedtools bamtofastq (http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html)
  - bedtools bedtobam (http://bedtools.readthedocs.io/en/latest/content/tools/bedtobam.html)
- See BEDTools Overview for some common use cases.
- Available in the TACC module system
RNA-seq QC, metrics & plotting tools:
- RSeQC – http://rseqc.sourceforge.net/
- RNA-SeQC (Broad Institute) –
  - https://software.broadinstitute.org/cancer/cga/rna-seqc
  - https://software.broadinstitute.org/cancer/cga/rnaseqc_run
- RNA-QC-Chain – http://bioinfo.single-cell.cn/rna-qc-chain.html

File formats and conversion

SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
HTS format specifications – http://samtools.github.io/hts-specs/
- clearinghouse page for a number of NGS formats (SAM, CRAM, VCF, BCF, etc.)
Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
SRA (Sequence Read Archive) from NCBI
- overview on this wiki
- SRA search home page
- SRA Toolkit
  - NCBI documentation
  - SRA toolkit downloads
BioITeam script for converting GTF/GFF3 files to BED format
- /work/projects/BioITeam/common/script/gtf_to_bed.pl
UCSC file format conversion scripts - useful for getting to/from wig WIG and bed BED to corresponding binary formats.
- Make sure you download the correct scripts for your operating system!
- Directories containing these tools can be found at TACC:
  - /work/projects/BioITeam/common/opt/UCSC_utils.2013_03
  - /work/projects/BioITeam/common/opt/UCSC_utils.2017_07
Mason program for simulating NGS sequencing reads

...

- Also available as a BioContainers module

UCSC Genome Browser

Main UCSC Genome Browser web site
- File formats - BED format especially is widely used
- Table browser - Browse and download data in different formats
- ENCODE data downloads at UCSC - useful for getting data to work with
- Beta Test browser site - most up-to-date datasets and features; can be buggy
Visualize mapped data at UCSC genome browser on this wiki

RNAseq/Transcriptome analysis

General RNA-seq Differential Gene Expression (DGE) analysis workflow from R's Bioconductor:
- https://www.bioconductor.org/help/workflows/rnaseqGene/
Gene quantification from BAM/BED file reads
- featureCounts (part of the Subread package) – http://subread.sourceforge.net/
- HTSeq – https://htseq.readthedocs.io/en/master/
HISAT2, StringTie, BallGown suite – https://ccb.jhu.edu/software/hisat2/index.shtml
- transcriptome-aware alignment & quantification from the Johns Hopkins group who brought you the Tuxedo pipeline – but much faster!
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
The Tuxedo pipeline: RNAseq with tophat/cufflinks
- RNAseq analysis protocol article in Nature Protocols
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
  - exon-aware sequence alignment (uses bowtie2/bowtie )
  - resource bundles for selected organisms (gff annotations, pre-built bowtie2 references, etc.)
- cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
  - transcript quantification, normalization, differential expression
General RNA-seq analysis workflow from Bioconductor:

https://www.bioconductor.org/help/workflows/rnaseqGene/

DESeq2 – R Bioconductor package for DGE
- DESeq (version 1) documentation:
  - https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
  - while DESeq2 is more sophisticated, reading the original documentation is a better introduction to concepts
- DESeq2 documentation:
  - http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
kallisto – https://pachterlab.github.io/kallisto/
- RNA-seq pseudoaligner that goes straight from fastq FASTQ to estimated transcript abundances
  - blindingly fast – but only to transcriptome
- companion quantification tool is sleuth – http://pachterlab.github.io/sleuth/about
- overview presentation – 2015-10-21-Kallisto.Anna.pdf
The Tuxedo pipeline: RNAseq with tophat/cufflinks
- one of the first tool suites for transcriptome-aware RNA-seq alignment and quantification
  - rarely used now, as other tools are much faster & more accurate
- RNAseq analysis protocol article in Nature Protocols
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
  - exon-aware sequence alignment (uses bowtie2/bowtie )
  - resource bundles for selected organisms (GFF annotations, pre-built bowtie2 references, etc.)
- cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
  - transcript quantification, normalization, differential expression
Dhivya Arasappan's Introduction to RNA Seq CCBB CBRS 2021 summer school course

Variant calling

Broad institute GATK (Genome Analysis Tool Kit GATK (Genome Analysis Tool Kit) – https://software.broadinstitute.org/gatk/documentation/
- complex but powerful
- used by TCGA (The Cancer Genome Atlas), 1000 Genomes
File formats
- VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
- MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)
The International Genome Sample Resource – follow-on to the 1000 Genomes project
- catalog of human genetic variants
Dan Deatherage's Genome Variant Analysis CCBB CBRS 2021 summer school course

Genome Annotation

GO – http://geneontology.org/
- The Gene Ontology resource, a large source of information on the functions of genes
GOrilla – http://cbl-gorilla.cs.technion.ac.il/
- Gene Ontology enRIchment anaLysis and visuaLizAtion tool
GSEA – https://www.gsea-msigdb.org
- Gene Set Enrichment Analysis
DAVID – https://david.ncifcrf.gov/
- functional Functional annotation from user-supplied gene lists
GREAT – http://bejerano.stanford.edu/great/public/html/
analysis tool that takes splash.php
- Genomic Regions Enrichment of Annotations Tool
- Takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
  - human, mouse, zebrafish
MEME-suite – http://meme-suite.org/
- a A motif identification and discovery tool. Works with most species.
- takes fasta Takes FASTA files as input
  - filter your bamBAM/bed BED files to get the regions of interest
  - then convert to fasta FASTA using bedtools bamtofastq.

...

Versions Compared

Old Version 53

New Version Current

Key

Linux/TACC

Community Resources

Community Resources

Sequencing Technologies

FASTQ analysis/manipulation/QC

Reference genomes

Basic alignment and aligners

Transcriptome-aware aligners

File formats and conversion

UCSC Genome Browser

RNAseq/Transcriptome analysis

Variant calling

Genome Annotation