A healthy taste of resources available, specifically for this course - not a comprehensive catalog.
Table of Contents |
---|
Linux/TACC
- Linux fundamentals on this wiki
- Wikis for the 3 CBRS Unix/Linux workshops:
Online tutorials:
- Ryan's Linux Tutorial: http://ryanstutorials.net/linuxtutorial/
- Unix and Perl bootcamp for Biologistsbiologists: http://korflab.ucdavis.edu/unix_and_Perl/
Community Resources
- SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data: bootcamp.html
- Unix primer (longer version) for biologists:seqanswerscom/forums/showthread.php?t=4589
Community Resources
- UCSC Genome Browser - visualize and download NGS data (see more below)Galaxy website for online sequencing data analysis
- Broad Institute Integrated Genomcs Genomics Viewer (IGV)
- especially good for visualizing BAM file details
- Michigan State University ANGUS resources
- A list of their tutorials: http://ged.msu.edu/angus/
- 2012 Next-Gen Sequence Analysis Workshop a similar tutorial to our course Introduction to Sequence analysis in the Amazon EC2 cloud
- where you can "rent" Linux machines (useful if you don't have access to TACC or BRCF pods)
- Galaxy website for online sequencing data analysis
- SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data: http://seqanswers.com/forums/showthread.php?t=4589
Sequencing Technologies
- Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer single molecule sequencing
- Single cell sequencing
- 10x Genomics platformplatforms
- Older technologies (less not common now)
Life Technologies SOLiD (short reads in "colorspace")
Roche/454 – long (multmulti-Kb) reads often used in assemblies
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
...
FASTQ analysis/manipulation/QC
- Wikipedia FASTQ format page
- Illumina library construction on GSAF user wiki - useful for contaminant detection or adapter removal
- FastQC from Babraham Bioinformatics – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- produces nice quality report for fastq FASTQ files
- MultiQC – http://multiqc.info/
- A great tool for consolidating QC multiple QC reports into one HTML page
- Anna's Byte Club tutorial on using MultiQC – https://wikisutexas.utexasatlassian.edunet/wiki/display/bioiteam/Using+MultiQC
- cutadapt – https://cutadapt.readthedocs.io/en/stable/
- An excellent command line tool for adapter sequence removal
- Good support for trimming paired-end datasetsAvailable at TACC at
/work/projects/BioITeam/ls5/opt/cutadapt-1.10/cutadapt
also needs this $PYTHONPATH modification:export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/site-packages:$PYTHONPATH"
- Script that handles the details of paired-end read trimming
/workwork2/projects/BioITeam/common/script/trim_adapters.sh
- trimmomatic – http://www.usadellab.org/cms/?page=trimmomatic
- Supports trimming paired-end datasets. I haven't used it but it seems to be popular.
- fastx toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
- Command Suite of command line tools for fastq FASTQ and FASTA analysis and manipulation
- Good for hard clipping. Available at TACC., FASTA file manipulations
- Documentation at: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
- seqtk – https://github.com/lh3/seqtk
- Suite of command line tools for FASTQ and FASTA analysis and manipulation
Reference genomes
- Gencode – https://www.gencodegenes.org/
- reference genomes, transcriptomes and high-quality annotations for human and mouse
- https://www.gencodegenes.org/releases/current.html
- UCSC downloads – http://hgdownload.cse.ucsc.edu/downloads.html
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
- Ensembl downloads – ftphttp://ftp.ensembl.org/pub/
- reference genomes, transcriptomes and high-quality annotations for many eukaryotes
- NCBI GenBank Nucleotide collection
- RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
- well curated genome, transcriptome sequences
- GenBank – https://www.ncbi.nlm.nih.gov/
- genbank/
- public repository for sequence data, especially for prokaryotic genomes
- not curated
- RefSeq – https://www.ncbi.nlm.nih.gov/refseq/
- Reference genome vocabulary – https://software.broadinstitute.org/gatk/documentation/article?id=7857
- excellent introduction to the types of genome references and the vocabulary used to describe them
- aimed at higher eukaryotes but vocabulary useful nonethelenonetheless
- excellent introduction to the types of genome references and the vocabulary used to describe them
- GATK blog describing ALT contigs in GRCh38 – https://software.broadinstitute.org/gatk/blog?id=8180
- Support for mapping to ALT contigs containing variants
- bwa mem + bwakit by Heng-Li – https://github.com/lh3/bwa/blob/master/README-alt.md
Basic alignment and aligners
- Comparison of different aligners
- by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
- File formats
- input: fastq FASTQ format
- output: the SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- SAM1.pdf – header fields, body fields, flag definitions
- https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf – tag fields
- Aligners
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- fast, sensitive and easy to use
- bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
- fast, sensitive and extremely configurable
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- Comparison of different aligners
- by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
- The BioITeam has some TACC-aware alignment scripts you might find useful:
- bwa alignment
/
work/projects/BioITeam/common/script
/align_bwa_illumina.sh
- bowtie2 alignment
/
work/projects/BioITeam/common/script/
align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
/
work/projects/BioITeam/common/script/
merge_sorted_bams.sh
- kallisto pseudo-alignment to annotated transcripts
/work/projects/BioITeam/common/script/run_kallisto.sh
- also available on many BRCF pods under /mnt/bioi/script.
- many pre-built references also available in /mnt/bioi/ref_genome
- email or come talk to Anna if you have questions or problems
- bwa alignment
Transcriptome-aware aligners
- HISAT2 – https://ccbdaehwankimlab.jhugithub.edu/softwareio/hisat2/index.shtml
- new and fast, with support for alignment to single and "population" of genomes
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
- STAR (Spliced Transcripts Alignment Spliced Transcripts Alignment to a Reference Reference) – ultra-fast RNAseq RNA-seq aligner
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
- kallisto - https://pachterlab.github.io/kallisto/about
- ultra-fast RNA-seq pseudoaligner that goes straight from fastq FASTQ to estimated transcript abundances
...
- SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- type in a decimal number to see which flags are set
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- samtools – by Heng Li
- samSAM/bamBAM conversion, flag filtering, sorting, indexing, duplicate filtering
- older 0.1.xx versions: http://samtools.sourceforge.net/
- newer 1.x3+ versions: http://www.htslib.org/
- Picard toolkit – http://broadinstitute.github.io/picard/
- samSAM/bam utilities BAM utilities that are read-group aware
- especially MarkDuplicatesand MarkDuplicatesWithMateCigar for flagging duplicate alignments
- SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for sam/bam files.
- bedtools – http://bedtools.readthedocs.org/en/latest/
- All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
- Swiss army knife for all manner of common bedcommon BED, bamBAM, vcfVCF, gfffile manipulation such as:
- intersecting bam or bed with annotation files
- bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)
- bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)
- merging overlapping regions
- bedtools merge (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html)
generation of per-base genome-wide signal in bedGraph formatbedtools coverage(http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)bedtools multicov(http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html)extracting fasta corresponding to regionsbedtools getfasta (http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) - bedtools merge (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html)
- format conversion
- bedtools bamtobed (http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html)
- bedtools bamtofastq (http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html)
- bedtools bedtobam (http://bedtools.readthedocs.io/en/latest/content/tools/bedtobam.html)
Available in the TACC module systemGFF/GTFfile manipulation.
- See BEDTools Overview for some common use cases.
- Available in the TACC module system
- RSeQC – http://rseqc.sourceforge.net/
RNA-SeQC (Broad Institute) –
- RNA-QC-Chain – http://bioinfo.single-cell.cn/rna-qc-chain.html
File formats and conversion
- SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
- HTS format specifications – http://samtools.github.io/hts-specs/
- clearinghouse page for a number of NGS formats (SAM, CRAM, VCF, BCF, etc.)
- Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
- SRA (Sequence Read Archive) from NCBI
- overview on this wiki
- SRA search home page
- SRA Toolkit
- BioITeam script for converting GTF/GFF3 files to BED format
/work/projects/BioITeam/common/script/gtf_to_bed.pl
- UCSC file format conversion scripts - useful for getting to/from wig WIG and bed BED to corresponding binary formats.
- Make sure you download the correct scripts for your operating system!
- Directories containing these tools can be found at TACC:
-
/work/projects/BioITeam/common/opt/UCSC_utils.2013_03
/work/projects/BioITeam/common/opt/UCSC_utils.2017_07
-
- Mason program for simulating NGS sequencing reads
...
- Also available as a BioContainers module
UCSC Genome Browser
- Main UCSC Genome Browser web site
- File formats - BED format especially is widely used
- Table browser - Browse and download data in different formats
- ENCODE data downloads at UCSC - useful for getting data to work with
- Beta Test browser site - most up-to-date datasets and features; can be buggy
RNAseq/Transcriptome analysis
- General RNA-seq Differential Gene Expression (DGE) analysis workflow from R's Bioconductor:
- Gene quantification from BAM/BED file reads
- featureCounts (part of the Subread package) – http://subread.sourceforge.net/
- HTSeq – https://htseq.readthedocs.io/en/master/
- HISAT2, StringTie, BallGown suite – https://ccb.jhu.edu/software/hisat2/index.shtml
- transcriptome-aware alignment & quantification from the Johns Hopkins group who brought you the Tuxedo pipeline – but much faster!
- paper: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html
- The Tuxedo pipeline: RNAseq with tophat/cufflinks
- RNAseq analysis protocol article in Nature Protocols
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
- resource bundles for selected organisms (gff annotations, pre-built bowtie2 references, etc.)
- cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
- transcript quantification, normalization, differential expression
- General RNA-seq analysis workflow from Bioconductor:
- DESeq2 – R Bioconductor package for DGE
- DESeq (version 1) documentation:
- https://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf
- while DESeq2 is more sophisticated, reading the original documentation is a better introduction to concepts
- DESeq2 documentation:
- DESeq (version 1) documentation:
- kallisto – https://pachterlab.github.io/kallisto/
- RNA-seq pseudoaligner that goes straight from fastq FASTQ to estimated transcript abundances
- blindingly fast – but only to transcriptome
- companion quantification tool is sleuth – http://pachterlab.github.io/sleuth/about
- overview presentation – 2015-10-21-Kallisto.Anna.pdf
- RNA-seq pseudoaligner that goes straight from fastq FASTQ to estimated transcript abundances
- The Tuxedo pipeline: RNAseq with tophat/cufflinks
- one of the first tool suites for transcriptome-aware RNA-seq alignment and quantification
- rarely used now, as other tools are much faster & more accurate
- RNAseq analysis protocol article in Nature Protocols
- TopHat - http://ccb.jhu.edu/software/tophat/index.shtml
- exon-aware sequence alignment (uses bowtie2/bowtie )
- resource bundles for selected organisms (GFF annotations, pre-built bowtie2 references, etc.)
- cuffquant, cuffnorm, cufflinks – http://cole-trapnell-lab.github.io/cufflinks/manual/
- transcript quantification, normalization, differential expression
- one of the first tool suites for transcriptome-aware RNA-seq alignment and quantification
- Dhivya Arasappan's Introduction to RNA Seq CCBB CBRS 2021 summer school course
Variant calling
- Broad institute GATK (Genome Analysis Tool Kit GATK (Genome Analysis Tool Kit) – https://software.broadinstitute.org/gatk/documentation/
- complex but powerful
- used by TCGA (The Cancer Genome Atlas), 1000 Genomes
- File formats
- VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
- MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)
- The International Genome Sample Resource – follow-on to the 1000 Genomes project
- catalog of human genetic variants
- Dan Deatherage's Genome Variant Analysis CCBB CBRS 2021 summer school course
Genome Annotation
- GO – http://geneontology.org/
- The Gene Ontology resource, a large source of information on the functions of genes
- GOrilla – http://cbl-gorilla.cs.technion.ac.il/
- Gene Ontology enRIchment anaLysis and visuaLizAtion tool
- GSEA – https://www.gsea-msigdb.org
- Gene Set Enrichment Analysis
- DAVID – https://david.ncifcrf.gov/
- functional Functional annotation from user-supplied gene lists
- GREAT – http://bejerano.stanford.edu/great/public/html/
analysis tool that takes splash.php- Genomic Regions Enrichment of Annotations Tool
- Takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- human, mouse, zebrafish
- MEME-suite – http://meme-suite.org/
- a A motif identification and discovery tool. Works with most species.
- takes fasta Takes FASTA files as input
- filter your bamBAM/bed BED files to get the regions of interest
- then convert to fasta FASTA using bedtools bamtofastq.
...