2020 Core NGS Resources
A healthy taste of resources available, specifically for this course - not a comprehensive catalog.
Linux
- 2020 Linux fundamentals on this wiki
Online tutorials:
- Ryan's Linux Tutorial: http://ryanstutorials.net/linuxtutorial/
- Unix and Perl for Biologists: http://korflab.ucdavis.edu/unix_and_Perl/
Community Resources
- SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data: http://seqanswers.com/forums/showthread.php?t=4589
- UCSC Genome Browser - visualize and download NGS data (see more below)
- Galaxy website for online sequencing data analysis
- Broad Institute Integrated Genomcs Viewer (IGV)
- especially good for visualizing BAM file details
- Introduction to Sequence analysis in the Amazon EC2 cloud
- where you can "rent" Linux machines (useful if you don't have access to TACC)
Sequencing Technologies
- Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer single molecule sequencing
- Single cell sequencing
- Older technologies (less common now)
Life Technologies SOLiD (short reads in "colorspace")
Roche/454 – long (multi-Kb) reads often used in assemblies
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
FASTQ analysis/manipulation/QC
- Wikipedia FASTQ format page
- Illumina library construction on GSAF user wiki - useful for contaminant detection or adapter removal
- FastQC from Babraham Bioinformatics – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- produces nice quality report for FASTQ files
- MultiQC – http://multiqc.info/
- A great tool for consolidating QC multiple QC reports into one HTML page
- Anna's Byte Club tutorial on using MultiQC – https://utexas.atlassian.net/wiki/display/bioiteam/Using+MultiQC
- Available on ls5 at
/work/projects/BioITeam/ls5/opt/multiqc-1.0/multiqc
- also needs this $PYTHONPATH modification:
export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/annab-packages:$PYTHONPATH"
- also needs this $PYTHONPATH modification:
- cutadapt – https://cutadapt.readthedocs.io/en/stable/
- An excellent command line tool for adapter sequence removal
- Good support for trimming paired-end datasets
- Available on ls5 at
/work/projects/BioITeam/ls5/opt/cutadapt-1.10/cutadapt
- also needs this $PYTHONPATH modification:
export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/site-packages:$PYTHONPATH"
- also needs this $PYTHONPATH modification:
- Script that handles the details of paired-end read trimming
/work/projects/BioITeam/common/script/trim_adapters.sh
- trimmomatic – http://www.usadellab.org/cms/?page=trimmomatic
- Supports trimming paired-end datasets.
- fastx toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
- Suite of command line tools for FASTQ and FASTA analysis and manipulation
- Good for hard clipping. Available at TACC.
- Documentation at: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
- seqtk – https://github.com/lh3/seqtk
- Suite of command line tools for FASTQ and FASTA analysis and manipulation
Reference genomes
- Gencode – https://www.gencodegenes.org/
- reference genomes, transcriptomes and high-quality annotations for human and mouse
- https://www.gencodegenes.org/releases/current.html
- UCSC downloads – http://hgdownload.cse.ucsc.edu/downloads.html
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
- Ensembl downloads – ftp://ftp.ensembl.org/pub/
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
- NCBI
- RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
- well curated genome, transcriptome sequences
- GenBank – https://www.ncbi.nlm.nih.gov/genbank/
- public repository for sequence data, especially for prokaryotic genomes
- not curated
- RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
- Reference genome vocabulary – https://software.broadinstitute.org/gatk/documentation/article?id=7857
- excellent introduction to the types of genome references and the vocabulary used to describe them
- aimed at higher eukaryotes but vocabulary useful nonethele
- excellent introduction to the types of genome references and the vocabulary used to describe them
- GATK blog describing ALT contigs in GRCh38 – https://software.broadinstitute.org/gatk/blog?id=8180
- Support for mapping to ALT contigs containing variants
- bwa mem + bwakit by Heng-Li – https://github.com/lh3/bwa/blob/master/README-alt.md
Basic alignment and aligners
- Comparison of different aligners
- by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
- File formats
- input: FASTQ format
- output: the SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Aligners
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- fast, sensitive and easy to use
- bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
- fast, sensitive and extremely configurable
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- The BioITeam has some TACC-aware alignment scripts you might find useful:
- bwa alignment
/
work/projects/BioITeam/common/script
/align_bwa_illumina.sh
- bowtie2 alignment
/
work/projects/BioITeam/common/script/
align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
/
work/projects/BioITeam/common/script/
merge_sorted_bams.sh
- email or come talk to Anna if you have questions or problems
- bwa alignment
Transcriptome-aware aligners
- HISAT2 – https://ccb.jhu.edu/software/hisat2/index.shtml
- new and fast, with support for alignment to single and "population" of genomes
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
- STAR (Spliced Transcripts Alignment to a Reference) – ultra-fast RNA-seq aligner
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
- kallisto - https://pachterlab.github.io/kallisto/about
- ultra-fast RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
Alignment analysis
- SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- type in a decimal number to see which flags are set
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- samtools – by Heng Li
- SAM/BAM conversion, flag filtering, sorting, indexing, duplicate filtering
- older 0.1.xx versions: http://samtools.sourceforge.net/
- newer 1.3+ versions: http://www.htslib.org/
- Picard toolkit – http://broadinstitute.github.io/picard/
- SAM/BAM utilities that are read-group aware
- especially MarkDuplicates for flagging duplicate alignments
- bedtools – http://bedtools.readthedocs.org/en/latest/
- All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
- Swiss army knife for all manner of common BED, BAM, VCF, GFF/GTF file manipulation.
- See BEDTools Overview for some common use cases.
- Available in the TACC module system
- SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for SAM/BAM files.
File formats and conversion
- SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
- Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
- SRA (Sequence Read Archive) from NCBI
- overview on this wiki
- SRA search home page
- SRA Toolkit
- UCSC file format conversion scripts - useful for getting to/from WIG and BED to corresponding binary formats.
- Make sure you download the correct scripts for your operating system!
- Directories containing these tools can be found at TACC:
/work/projects/BioITeam/common/opt/UCSC_utils.2019_09
/work/projects/BioITeam/common/opt/UCSC_utils.2017_07
UCSC Genome Browser
- Main UCSC Genome Browser web site
- File formats - BED format especially is widely used
- Table browser - Browse and download data in different formats
- ENCODE data downloads at UCSC - useful for getting data to work with
- Beta Test browser site - most up-to-date datasets and features; can be buggy
- 2020 Visualize mapped data at UCSC genome browser on this wiki
RNAseq/Transcriptome analysis
- Gene quantification from BAM/BED file reads
- featureCounts (part of the Subread package) – http://subread.sourceforge.net/
- HTSeq – https://htseq.readthedocs.io/en/master/
- The Tuxedo pipeline: RNAseq with tophat/cufflinks
- one of the first tool suites for transcriptome-aware RNA-seq alignment and quantification
- RNAseq analysis protocol article in Nature Protocols
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
- resource bundles for selected organisms (GFF annotations, pre-built bowtie2 references, etc.)
- cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
- transcript quantification, normalization, differential expression
- HISAT2, StringTie, BallGown suite – https://ccb.jhu.edu/software/hisat2/index.shtml
- from the Johns Hopkins group who brought you the Tuxedo pipeline – but much faster!
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
- General RNA-seq analysis workflow from Bioconductor:
- DESeq2 – R Bioconductor package
- DESeq (version 1) documentation:
- https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
- while DESeq2 is more sophisticated, reading the original documentation is a better introduction to concepts
- DESeq2 documentation:
- kallisto – https://pachterlab.github.io/kallisto/
- RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
- blindingly fast – but only to transcriptome
- companion quantification tool is sleuth – http://pachterlab.github.io/sleuth/about
- overview presentation – 2015-10-21-Kallisto.Anna.pdf
- RNA-seq pseudoaligner that goes straight from FASTQ to estimated transcript abundances
- Dhivya Arasappan's Introduction to RNA Seq CCBB summer school course
Variant calling
- Broad institute GATK (Genome Analysis Tool Kit) – https://software.broadinstitute.org/gatk/documentation/
- complex but powerful
- used by TCGA, 1000 Genomes
- File formats
- VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
- MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)
- The International Genome Sample Resource – follow-on to the 1000 Genomes project
- catalog of human genetic variants
- Dan Deatherage's Genome Variant Analysis CCBB summer school course
Genome Annotation
- GO – http://geneontology.org/
- The Gene Ontology resource, a large source of information on the functions of genes
- GOrilla – http://cbl-gorilla.cs.technion.ac.il/
- Gene Ontology enRIchment anaLysis and visuaLizAtion tool
- DAVID – https://david.ncifcrf.gov/
- functional annotation from user-supplied gene lists
- GREAT – http://bejerano.stanford.edu/great/public/html/splash.php
- Genomic Regions Enrichment of Annotations Tool
- takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- human, mouse, zebrafish
- MEME-suite – http://meme-suite.org/
- a motif identification and discovery tool. Works with most species.
- takes FASTA files as input
- filter your BAM/BED files to get the regions of interest
- then convert to FASTA using bedtools bamtofastq.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.