Question: salmon NumReads inconsistent with depth from --writeMappings?
0
gravatar for ningyuan.sg
16 days ago by
ningyuan.sg0 wrote:

I am trying to count gene expression for differential expression analysis. I used salmon to quantify genes against a reference transcriptome to get their TPM, but also --writeMappings to check their max read depth. I got:

       Name   EffectiveLength       TPM      NumReads   MaxDepth
NR_146118.1             4,925   177,186   31,949,345           6
NR_132964.1                24   124,223      109,585       111,527

MaxDepth for NR_132964.1 was amended from 8,038 to 111,527 after comments pointing out that samtools without argument -d 0 limits MaxDepths to about 8000.

MaxDepth is the maximum read depth from samtools view on the output of --writeMappings. The rest of the columns are from salmon's quant.sf output file.

I understand why there is not a linear relationship between NumReads and TPM. But why are their MaxDepths so different?


Some more data. For MaxDepth == NA, samtools depth found no depth there---which shouldn't be the case for these top few expressed genes.

   Name           Length EffectiveLength      TPM  NumReads MaxDepth
 1 U13369.1        42999         42847.      28.0    43870.     8034
 2 NM_001190452.1   1555          1403.   12043.    618711.     8028
 3 NM_001190470.1   1036           884.   13554.    438843.     8035
 4 NM_001190476.1   1231          1079.    1531.     60512.     8005
 5 NM_001190487.2   1395          1243.    1262.     57434.     8022
 6 NM_001190702.1   1290          1138.   20012.    833999.     8036
 7 NM_001190706.1    527           376.   13917.    191698.     4476
 8 NM_001190708.1   1121           969.    3952.    140262.     8024
 9 NR_003285.3       157            39.1 151275.    216553.       57
10 NR_003286.4      1869          1717.     777.     48851.     8000
11 NR_132964.1       128            24.1 124223.    109585      8038
12 NR_145819.1     13351         13199.      45.5    21968.     8001
13 NR_145822.1      5066          4914.     825.    148378.       NA
14 NR_146117.1     13373         13221.     222.    107270.     8016
15 NR_146118.1      5077          4925.  177186.  31949345.        6
16 NR_146119.1      1869          1717.  439340.  27622373.      122
17 NR_146120.1       156            38.5  23238.     32776.       NA
18 NR_146144.1     13315         13163.    1264.    608993.     8009
19 NR_146148.1      5054          4902.     649.    116444.       NA
20 NR_146154.1      5055          4903.    2997.    538036.       NA

It has been pointed out in the comments that many of the MaxDepths around the 8000s is due to the default depth limit of samtools view. Nonetheless, I'd still wonder why some of these transcripts with high TPM and NumReads don't show any depth.

ADD COMMENTlink modified 12 days ago • written 16 days ago by ningyuan.sg0

probably because gene have different expression level?

ADD REPLYlink written 14 days ago by Morris_Chair120

Sure, but wouldn't their maximum read depths be roughly proportional to their TPM?

ADD REPLYlink written 14 days ago by ningyuan.sg0

TPM and Depth are calculated using different algorithm for instance TPM as well as RPKM are counts based on the transcript length and read numbers for that gene. The lower TPM and the slightly higher depth might be due to a longer transcript. That's the explanation that I have, but I'm not the best bioinformatician around ;)

ADD REPLYlink written 14 days ago by Morris_Chair120

I understand your explanation, but I have some doubts as to the likelihood that this difference would cause such a counterintuitive and drastic difference in read depths. Thank nonetheless, though.

ADD REPLYlink written 14 days ago by ningyuan.sg0

I cannot answer your question but for differential expression you do not need TPM. Also you would not need to look into the files. The recommended next step would be to use the tximport package to summarize the transcript level quantifications to the gene level followed by differential analysis with e.g. DESeq2/edgeR/limma.

Still I will tag Rob (the developer of salmon). Please add the command you used to get this MaxDepth column.

ADD REPLYlink modified 12 days ago • written 12 days ago by ATpoint19k

Thank you. I obtained the MaxDepth column by using samtools depth <ALIGNMENT> -r <REGION>, where <alignment> is the indexed and sorted file output by salmon ... --writeMappings=<ALIGNMENT>. [1] Columns are combined with a simple R script [2].

In my next step, I use edgeR to import gene counts in quant.sf using the first and fourth columns (Name, TPM). That is, I use the R command

library(edgeR)
files <- c(...)
count_matrix <- readDGE(files, columns=c(1,4))

[1]

all: MAP SORT INDEX DEPTH

INDEX_DIR = /home/shared/GenomeReferences/hg38_refseq_transcripts_with_rdna

MAP: WT1.mappings.sam
WT1.mappings.sam: ../WILDTYPE1_CGATGTA_S42_L008_R1_001.fastq.gz \
        ../WILDTYPE1_CGATGTA_S42_L008_R2_001.fastq.gz
        salmon quant -i ../index -l A -1 $< -2 $(word 2,$^) --validateMappings -o $@ \
                --writeMappings=WT1.mappings.sam

SORT: WT1.sorted.bam
WT1.sorted.bam: WT1.mappings.sam
        samtools sort -o WT1.sorted.bam WT1.mappings.sam

INDEX: WT1.sorted.bam.bai
WT1.sorted.bam.bai: WT1.sorted.bam
        samtools index WT1.sorted.bam

REGIONS = NR_146118.1.depth NR_146119.1.depth NM_001190702.1.depth \
        NM_001190452.1.depth NR_146144.1.depth NR_146154.1.depth \
        NM_001190470.1.depth NR_003285.3.depth NM_001190706.1.depth \
        NR_145822.1.depth NM_001190708.1.depth NR_132964.1.depth \
        NR_146148.1.depth NR_146117.1.depth NR_003286.4.depth \
        NM_001190476.1.depth NM_001190487.2.depth U13369.1.depth \
        NR_146120.1.depth NR_145819.1.depth
DEPTH: $(REGIONS)
$(REGIONS): %.depth: %
        samtools depth WT1.sorted.bam -r $< > $@
$(patsubst %.depth,%,$(REGIONS)):

[2]

#! /usr/bin/env Rscript

suppressPackageStartupMessages(library(tidyverse))

depths <- suppressMessages(read_tsv('quant.sf'))
depths <- subset(depths, Name %in% c(
        'NM_001190452.1', 'NM_001190470.1', 'NM_001190476.1', 'NM_001190487.2',
        'NM_001190702.1', 'NM_001190706.1', 'NM_001190708.1',
        'NR_003285.3', 'NR_003286.4',
        'NR_132964.1',
        'NR_145819.1', 'NR_145822.1',
        'NR_146117.1', 'NR_146118.1', 'NR_146119.1', 'NR_146120.1',
        'NR_146144.1', 'NR_146148.1', 'NR_146154.1',
        'U13369.1'))

max.depths <- c()
for (name in depths$Name) {
    filename <- paste0(name, '.depth')
    depth.data <- suppressMessages(
            read_tsv(filename, col_names=c('name', 'pos', 'depth')))
    max.depths <- c(
            max.depths,
            suppressWarnings(max(depth.data$depth)))
}
depths$MaxDepth <- max.depths
depths$MaxDepth[depths$MaxDepth==-Inf] <- NA

depths
ADD REPLYlink written 12 days ago by ningyuan.sg0
  1. What not use tximport?
  2. What version of samtools are you using and are you changing the maximum possible depth option in samtools depth?
ADD REPLYlink written 12 days ago by Devon Ryan91k
  1. Frankly, I've just heard of tximport. Obviously, I should be using that, but... I think the apparent discrepancy between maximum read depth and TPM still hold, even if we are looking at a transcript rather than gene level.
  2. samtools 1.9, Using htslib 1.9. No, I do not pass any option to samtools depth besides specifying the region. See [1] above, or the Makefile extract:

        samtools depth WT1.sorted.bam -r $< > $@
    
ADD REPLYlink written 12 days ago by ningyuan.sg0
-d/-m <int>         maximum coverage depth [8000]. If 0, depth is set to the maximum
                       integer value, effectively removing any depth limit.

You should set -d to 0 when running samtools depth otherwise it is using 8000 and that may be leading to odd numbers you are getting.

You may also want to look at mosdepth as a fast alternative (https://github.com/brentp/mosdepth ).

ADD REPLYlink modified 12 days ago • written 12 days ago by genomax70k

What I think you're observing is a result of the default ~8000 depth in samtools depth and samtools mpileup. In essence, this is the size of a buffer for holding reads and once the depth goes much above this you're going to start seeing some odd behavior due to that (I don't recall the details of this edge case, I'd have to check the source code to see how it's handled).

ADD REPLYlink written 12 days ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2155 users visited in the last hour