samtools depth vs Rsamtools pileup
0
0
Entering edit mode
2.9 years ago
chasem ▴ 50

I am calculating coverage over the CDS from a bam file using samtools depth and Rsamtools pileup. I expected the result to be the same, or at least very similar, but that hasn't quite been the case.

With samtools, I am counting bases covered by at least one read like so:

grep $gene_id $gff | grep CDS | gff2bed | \
samtools depth -aa -Q 10 -b - $bam_path | \
cut -f3 | grep -v 0 | wc -l

The output, prior to the line starting cut... looks like this:

❯ head ~/Desktop/tmp/coverage_test_out.txt                                                                                     
CP022326.1  650760  5
CP022326.1  650761  5
CP022326.1  650762  5
CP022326.1  650763  5
CP022326.1  650764  5
CP022326.1  650765  5
CP022326.1  650766  5
CP022326.1  650767  5
CP022326.1  650768  12
CP022326.1  650769  12
...

With Rsamtools::pileup, I am doing the following:

#'
#' @param bamfile_path path to a bam file 
#' @param gr a GRanges object with the range(s) of interest
#' @param strandedness one of c("stranded", "unstranded"). NOTE: forward only strand NOT currently configured
#' @param quality_threshold default 10L, equivalent to current HTSeq default
#'
#' @return number of covered bases in CDS
#'
calculateCoverage_test = function(bamfile_path, gr, strandedness, quality_threshold=10L,
                             coverage_threshold=1, lts_align_expr_prefix=Sys.getenv("LTS_ALIGN_EXPR_PREFIX"),
                             bamfile_suffix='"_sorted_aligned_reads_with_annote.bam"'){

  bamfile_index = paste0(bamfile_path, ".bai")

  p_param <- PileupParam(min_mapq = quality_threshold,
                         min_nucleotide_depth = coverage_threshold,
                         distinguish_strands=TRUE,
                         distinguish_nucleotides=FALSE)

  # create the scanBamParam object.
  gene_strand = as.character(unique(data.frame(gr)$strand))
  minus_strand_flag = NA
  if (strandedness=='reverse' & gene_strand == '+'){
    minus_strand_flag = TRUE
  } else if (strandedness=='reverse' & gene_strand == '-'){
    minus_strand_flag = FALSE
  }

  sbp = ScanBamParam(which=gr,
                     flag=scanBamFlag(isMinusStrand=minus_strand_flag,
                                      isSecondaryAlignment=FALSE,
                                      isNotPassingQualityControls=FALSE,
                                      isSupplementaryAlignment=FALSE))

  coverage_df = pileup(bamfile_path,
                       index = bamfile_index,
                       scanBamParam = sbp,
                       pileupParam = p_param)

# take only distinct positions, and create a boolean vector 
# based on whether the count for a given position is > 0. 
# Sum over the boolean vector
  sum(coverage_out %>% 
            distinct(pos, .keep_all=TRUE) %>% 
            pull(count) > 0)

coverage_df, before the sum() cmd, looks like this:

> head(coverage_out)
    seqnames    pos strand count              which_label
1 CP022326.1 650760      -     5 CP022326.1:650760-650894
2 CP022326.1 650761      -     5 CP022326.1:650760-650894
3 CP022326.1 650762      -     5 CP022326.1:650760-650894
4 CP022326.1 650763      -     5 CP022326.1:650760-650894
5 CP022326.1 650764      -     5 CP022326.1:650760-650894
6 CP022326.1 650765      -     5 CP022326.1:650760-650894

In both cases, I am then dividing by the length of the gene. Thankfully, the gene length does come out the same from both tools.

However, the RSamtools method returns 167 more covered bases than the samtools depth command. That is a difference between 98% covered and 89% covered, respectively. In looking at the browser, it is hard to tell -- this gene is supposed to be knocked out, and obviously is not. The difference between 89% and 98% covered is pretty close, to my eye. That said, I'd like to know what I'm doing that is causing the difference.

If anyone has any suggestions or ideas, I would appreciate it.

Samtools RSamtools Bioconductor R • 566 views
ADD COMMENT
0
Entering edit mode

oh my...in looking at this again, I just realized what a stupid mistake that was. I am going to leave this up, just in case someone else is this much of a bonehead.

Clearly, grep -v 0 removes anything with a zero in it...which means that something with '30' counts gets removed.

ADD REPLY

Login before adding your answer.

Traffic: 1880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6