Rsamtools::pileup gives inconsistent number of reads
1
2
Entering edit mode
6.9 years ago

I have created a fasta file with 300 simulated reads at a given SNP position (reads are 75mer, and I simulate every position in the + and - strand, hence 75x4=300 reads). Then, I map them back to the ref genome using bowtie and then use pileup to see the alignment at the position in question:

bf <- BamFile(bamfile)
param <- ScanBamParam(which=GRanges("chr10",IRanges(start=100189138, end=100189138)))
     

quickBamFlagSummary shows that there are 300 reads

> quickBamFlagSummary(bf, param=param)
                                group |    nb of |    nb of | mean / max
                                   of |  records |   unique | records per
                              records | in group |   QNAMEs | unique QNAME
All records........................ A |      300 |      300 |    1 / 1
  o template has single segment.... S |      300 |      300 |    1 / 1
  o template has multiple segments. M |        0 |        0 |   NA / NA
      - first segment.............. F |        0 |        0 |   NA / NA
      - last segment............... L |        0 |        0 |   NA / NA
      - other segment.............. O |        0 |        0 |   NA / NA

Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
Indentation reflects this.

Details for group S:
  o record is mapped.............. S1 |      300 |      300 |    1 / 1
      - primary alignment......... S2 |      300 |      300 |    1 / 1
      - secondary alignment....... S3 |        0 |        0 |   NA / NA
  o record is unmapped............ S4 |        0 |        0 |   NA / NA

Pileup only shows 262 reads (63+62+75+62)... Any idea why? I dont use any threshold on min_base_quality and min_mapq...

> pileup(bf, scanBamParam=param, min_base_quality=0,min_mapq=0)
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    63 chr10:100189138-100189138
2    chr10 100189138      -          A    62 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    62 chr10:100189138-100189138

I looked the region up with IGV, and also, IGV reports 300 reads.

I actually simulate several different positions, and the 262 reads (instead of 300) appear every time.. Not sure why. I am sure pileup is filtering the reads somehow..

Also, the same exact script works fine if I simulate 36mer reads (instead of 75...)! Any ideas why?

rsamtools pileup • 2.6k views
ADD COMMENT
1
Entering edit mode

Could you post a BAM file just containing this region somewhere for debugging? Also, what's the output of sessionInfo()?

ADD REPLY
1
Entering edit mode

Actually, note that pileup() takes not only a scanBamParam option, but also a pileupParam option. Given that the latter defaults to some filtering, I wouldn't be surprised if that's the source of the problem.

ADD REPLY
0
Entering edit mode

Thank you! it is solved by changing max_depth parameter within PileupParam()

> pileup(bf, scanBamParam=param, pileupParam=PileupParam(max_depth=8000, min_mapq=0, min_base_quality=0))
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    75 chr10:100189138-100189138
2    chr10 100189138      -          A    75 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    75 chr10:100189138-100189138

I was adding my arguments without PileupParam constructor, this is wrong:

> pileup(bf, scanBamParam=param, min_base_quality=0,min_mapq=0, max_depth=8000)
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    63 chr10:100189138-100189138
2    chr10 100189138      -          A    62 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    62 chr10:100189138-100189138
ADD REPLY
0
Entering edit mode

this is what samtools pileup command (from samtools) gives:

samtools mpileup -B -sf $genome.fa -r chr10:100189138-100189138 test_aln75.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr10    100189138    A    300    G$.$g$,$G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,G.g,^~G^~.^~g^~,    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

counting in python shows 150 reads in each allele - the reference ("A") and alternative ("G")  (75 reads in each strand):

>>> a.count("G")
75
>>> a.count("g")
75
>>> a.count(".")
75
>>> a.count(",")
75
ADD REPLY
1
Entering edit mode
6.9 years ago

Thank you! it is solved by changing max_depth parameter within PileupParam()

> pileup(bf, scanBamParam=param, pileupParam=PileupParam(max_depth=8000, min_mapq=0, min_base_quality=0))
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    75 chr10:100189138-100189138
2    chr10 100189138      -          A    75 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    75 chr10:100189138-100189138

I was adding my arguments without PileupParam constructor, this is wrong:

> pileup(bf, scanBamParam=param, min_base_quality=0,min_mapq=0, max_depth=8000)
  seqnames       pos strand nucleotide count               which_label
1    chr10 100189138      +          A    63 chr10:100189138-100189138
2    chr10 100189138      -          A    62 chr10:100189138-100189138
3    chr10 100189138      +          G    75 chr10:100189138-100189138
4    chr10 100189138      -          G    62 chr10:100189138-100189138
ADD COMMENT

Login before adding your answer.

Traffic: 3026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6