...
Code Block |
---|
language | bash |
---|
title | Start an idev session |
---|
|
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Fri # or -A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development # or -A TRA23004 |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
...
Expand |
---|
|
Code Block |
---|
| module load samtools
cd $SCRATCH/core_ngs/bedtools_multicov
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt |
Code Block |
---|
title | samtools flagstat output |
---|
| 3347559 + 0 in total (QC-passed reads + QC-failed reads)
24317 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3347559 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5) |
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities: Code Block |
---|
| samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c |
Code Block |
---|
title | distribution of mapping qualities |
---|
| 498 20
6504 21
1012 22
355 23
1054 24
2800 25
495 26
14133 27
282 28
358 29
954 30
1244 31
358 32
6143 33
256 34
265 35
1112 36
905 37
309 38
4845 39
5706 40
427 41
1946 42
1552 43
1771 44
6140 45
1771 46
3049 47
3881 48
3264 49
4475 50
15692 51
25378 52
16659 53
18305 54
7108 55
2705 56
59867 57
2884 58
2392 59
3118705 60 |
To compute average mapping quality: Code Block |
---|
| samtools view yeast_mrna.sort.filt.bam | awk '
BEGIN{FS="\t"; sum=0; tot=0}
{sum = sum + $5; tot = tot + 1}
END{printf("mapping quality average: %.1f for %d reads\n", sum/tot,tot) }' |
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
...
Expand |
---|
|
Code Block |
---|
| idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5 CoreNGS # or -A TRA23004
module load biocontainers
module load samtools
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
|
...
Expand |
---|
|
Code Block |
---|
| wc -l yeast_mrna_gene_counts.bed |
6485 records were written, one for each feature in the merged.sc_genes.bed file. The overlap count was added as the last field in each output record (. So here it is field 7 , since the input annotation file had 6 columns). |
Exercise: How many features have non-zero overlap counts?
...
A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions that shows the count of overlapping bases for the regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).
- Go to the UCSC Genome Browser: https://genome.ucsc.edu/
- Select Genomes from the top menu bar
- Select Human from POPULAR SPECIES
- under Human Assembly select Feb 2009 (GrCh37/hg19)
- select GO
- In the hg19 browser page, the Layered H3K27Ac track is a signal track
- the x-axis is the genome position
- the y-axis represents the count of ChIP-seq reads that overlap each position
- where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)
...
Code Block |
---|
language | bash |
---|
title | Prepare for bedtools coverage |
---|
|
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-day5 # or -A TRA23004
# or
idev -m 90 -N 1 -A OTH21164 -p development # or -A TRA23004
module load biocontainers
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools_genomecov
cd $SCRATCH/core_ngs/bedtools_genomecov
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* . |
Then calling bedtools genomecov is easy. The -bg option says to report the depth in bedGraph format.
Code Block |
---|
|
cd $SCRATCH/core_ngs/bedtools_genomecov
bedtools genomecov -bg -ibam yeast_mrna.sort.filt.bam > yeast_mrna.genomecov.bedGraph
wc -l yeast_mrna.genomecov.bedGraph # 1519274 lines |
...
Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a coresponding corresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:
...
Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed(see Tips and tricks#perlpatternsubstitution):
Code Block |
---|
|
module load ucsc_tools
cd $SCRATCH/core_ngs/bedtools_genomecov
bedGraphToBigWig # look at its usage
# create the needed chromosome sizes file from our BAM header
module load samtools
samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \
perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt
cat sc_chrom_sizes.txt
# displays:
chrI 230218
chrII 813184
chrIII 316620
chrIV 1531933
chrV 576874
chrVI 270161
chrVII 1090940
chrVIII 562643
chrIX 439888
chrX 745751
chrXI 666816
chrXII 1078177
chrXIII 924431
chrXIV 784333
chrXV 1091291
chrXVI 948066
chrM 85779 |
...
Code Block |
---|
|
cd $SCRATCH/core_ngs/bedtools_genomecov
export LC_COLLATE=C # may or may not need this...
sort -k1,1 -k2,2n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
bedGraphToBigWig yeast_mrna.genomecov.sorted.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw |
...