Core NGS Resources

A healthy taste of resources available, specifically for this course - not a comprehensive catalog.

Linux

Linux fundamentals on this wiki
Online tutorials:
- Ryan's Linux Tutorial: http://ryanstutorials.net/linuxtutorial/
- Unix and Perl for Biologists: http://korflab.ucdavis.edu/unix_and_Perl/

Community Resources

SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data: http://seqanswers.com/forums/showthread.php?t=4589
UCSC Genome Browser - visualize and download NGS data (see more below)
Galaxy website for online sequencing data analysis
Broad Institute Integrated Genomcs Viewer (IGV)
- especially good for visualizing bam file details
Michigan State University ANGUS resources
- A list of their tutorials: http://ged.msu.edu/angus/
- 2012 Next-Gen Sequence Analysis Workshop a similar tutorial to our course
- Introduction to Sequence analysis in the Amazon EC2 cloud
  - where you can "rent" Linux machines (useful if you don't have access to TACC)

Sequencing Technologies

Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer "single molecule" sequencing
  - Oxford Nanopore
  - PacBio SMRT system
    - PCR-free protocol
- "Single cell" sequencing
  - 10x Genomics platform
- Older technologies (less common now)
  - Life Technologies SOLiD (short reads in "colorspace")
  - Roche/454 – long (mult-Kb) reads often used in assemblies

Fastq analysis/manipulation

Wikipedia FASTQ format page
Illumina library construction on GSAF user wiki - useful for contaminant detection or adapter removal
FastQC from Babraham Bioinformatics – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- produces nice quality report for fastq files
cutadapt – https://cutadapt.readthedocs.io/en/stable/
- An excellent command line tool for adapter sequence removal
- Good support for trimming paired-end datasets
- Available at TACC at /work/projects/BioITeam/ls5/opt/cutadapt-1.10/cutadapt
  - also needs this $PYTHONPATH modification:
    export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/site-packages:$PYTHONPATH"
- Script that handles the details of paired-end read trimming
  - /work/projects/BioITeam/common/script/trim_adapters.sh
trimmomatic – http://www.usadellab.org/cms/?page=trimmomatic
- Supports trimming paired-end datasets. I haven't used it but it seems to be popular.
fastx toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
- Command line tools for fastq analysis and manipulation
- Good for hard clipping. Available at TACC.
- Documentation at: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

Reference genomes

Gencode – https://www.gencodegenes.org/
- reference genomes, transcriptomes and high-quality annotations for human and mouse
- https://www.gencodegenes.org/releases/current.html
UCSC downloads – http://hgdownload.cse.ucsc.edu/downloads.html
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
Ensembl downloads – ftp://ftp.ensembl.org/pub/
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
NCBI GenBank Nucleotide collection – https://www.ncbi.nlm.nih.gov/nuccore/
- for prokaryotic genomes
Reference genome vocabulary – https://software.broadinstitute.org/gatk/documentation/article?id=7857
- excellent introduction to the types of genome references and the vocabulary used to describe them
  - aimed at higher eukaryotes but vocabulary useful nonethele
GATK blog describing ALT contigs in GRCh38 – https://software.broadinstitute.org/gatk/blog?id=8180
Support for mapping to ALT contigs containing variants
- bwa mem + bwakit by Heng-Li – https://github.com/lh3/bwa/blob/master/README-alt.md

Basic alignment and aligners

Comparison of different aligners
- by Heng Li, developer of BWA, samtools, and many other
File formats
- input: fastq format
- output: the SAM (Sequence Alignment Map) format specification (SAM1.pdf)
Aligners
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
  - fast, sensitive and easy to use
- bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
  - fast, sensitive and extremely configurable
The BioITeam has some TACC-aware alignment scripts you might find useful:
- bwa alignment
  - /work/projects/BioITeam/common/script/align_bwa_illumina.sh
- bowtie2 alignment
  - /work/projects/BioITeam/common/script/align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
  - /work/projects/BioITeam/common/script/merge_sorted_bams.sh
- email or come talk to me if you have questions or problems

Transcriptome-aware aligners

HISAT2 – https://ccb.jhu.edu/software/hisat2/index.shtml
- new and fast, with support for alignment to single and "population" of genomes
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
STAR (Spliced Transcripts Alignment to a Reference) – ultra-fast RNAseq aligner
TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
kallisto - https://pachterlab.github.io/kallisto/about
- ultra-fast RNA-seq pseudoaligner that goes straight from fastq to estimated transcript abundances

Alignment analysis

SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
  - type in a decimal number to see which flags are set
samtools – by Heng Li
- sam/bam conversion, flag filtering, sorting, indexing, duplicate filtering
- 0.1.xx versions: http://samtools.sourceforge.net/
- 1.x+ versions: http://www.htslib.org/
Picard toolkit – http://broadinstitute.github.io/picard/
- sam/bam utilities that are read-group aware
- especially MarkDuplicates and MarkDuplicatesWithMateCigar for flagging duplicate alignments
  - http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
  - http://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesWithMateCigar
SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for sam/bam files.
bedtools – http://bedtools.readthedocs.org/en/latest/
- All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
- Swiss army knife for all manner of common bed, bam, vcf, gff file manipulation such as:
  - intersecting bam or bed with annotation files
    - bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)
  - merging overlapping regions
    - bedtools merge (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html)
  - generation of per-base genome-wide signal in bedGraph format
    - bedtools coverage (http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)
    - bedtools multicov (http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html)
  - extracting fasta corresponding to regions
    - bedtools getfasta (http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)
  - format conversion
    - bedtools bamtobed (http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html)
    - bedtools bamtofastq (http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html)
    - bedtools bedtobam (http://bedtools.readthedocs.io/en/latest/content/tools/bedtobam.html)
- Available in the TACC module system

File formats and conversion

SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
SRA (Sequence Read Archive) from NCBI
- overview on this wiki
- SRA search home page
- SRA Toolkit
  - NCBI documentation
  - SRA toolkit downloads
UCSC file format conversion scripts - useful for getting to/from wig and bed to corresponding binary formats.
- Make sure you download the correct script for your operating system!
- Directories containing these tools can be found on ls5 at
  - /work/projects/BioITeam/common/opt/UCSC_utils.2013_03
  - /work/projects/BioITeam/common/opt/UCSC_utils.2017_07
Mason program for simulating NGS sequencing reads

UCSC Genome Browser

Main UCSC Genome Browser web site
- File formats - BED format especially is widely used
- Table browser - Browse and download data in different formats
- ENCODE data downloads at UCSC - useful for getting data to work with
- Beta Test browser site - most up-to-date datasets and features; can be buggy
Visualize mapped data at UCSC genome browser on this wiki

RNAseq/Transcriptome analysis

HISAT2, StringTie, BallGown suite – https://ccb.jhu.edu/software/hisat2/index.shtml
- from the Johns Hopkins group who brought you the Tuxedo pipeline – but much faster!
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
The Tuxedo pipeline: RNAseq with tophat/cufflinks
- RNAseq analysis protocol article in Nature Protocols
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
  - exon-aware sequence alignment (uses bowtie2/bowtie )
  - resource bundles for selected organisms (gff annotations, pre-built bowtie2 references, etc.)
- cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
  - transcript quantification, normalization, differential expression
General RNA-seq analysis workflow from Bioconductor:

https://www.bioconductor.org/help/workflows/rnaseqGene/

DESeq2 – R Bioconductor package
- DESeq (version 1) documentation:
- DESeq2 documentation:
kallisto pseudo-alignment – https://pachterlab.github.io/kallisto/
- blindingly fast – but only to transcriptome
- companion quantification tool is sleuth – http://pachterlab.github.io/sleuth/about
- overview presentation – 2015-10-21-Kallisto.Anna.pdf
Dhivya Arasappan's Introduction to RNA Seq CCBB summer school course

Variant calling

Tools
- Broad institute GATK - complex but powerful; used by TCGA, 1000 Genomes
  - documentation page: https://software.broadinstitute.org/gatk/documentation/
File formats
- VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
- MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)
The International Genome Sample Resource – follow-on to the 1000 Genomes project
- catalog of human genetic variants
Dan Deatherage's Genome Variant Analysis CCBB summer school course

Genome Annotation

DAVID – functional annotation from user-supplied gene lists
GREAT – analysis tool that takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- for human, mouse, zebrafish
MEME-suite – a motif identification and discovery tool. Works with most species.
- takes fasta files as input, so filter your bam/bed files to get the regions of interest, then convert over using bamtofastq in bedtools.