Question: Downsample a Bam file (Picard DownsampleSam)
0
gravatar for ra
8 months ago by
ra0
ra0 wrote:

Hi all,

I have been using Picard's DownsampleSam to extract random alignments from a .bam file.

The tool documentation for DownsampleSam states that

Reads marked as not primary alignments are all discarded

However, when running DownsampleSam (Picard release 2.4.1) on my .bam file, these 'not primary alignments' are not discarded.
For example, running samtools flagstat on my .bam file indicates I have a total of 66427440 reads, of which 22101618 are secondary alignments. Therefore, DownsampleSam should only take 44325822 reads forward for downsampling, but it is downsampling from all 66427440 reads ("Kept 39866332 out of 66427440 reads (60.01%)"). Below is the command line for the above:

DownsampleSam I=input.bam O=downsampled.bam RANDOM_SEED=2 P=0.6

I then ran the above command again, but using Picard release 1.130. This time, DownsampleSam did indeed discard the secondary alignments as expected (it is now downsampling from 44325822 reads: "Finished! Kept 35458630 out of 44325822 reads"). The reason why I am not using this release of Picard is because it does not support the STRATEGY option, which I will need for downsampling to smaller fractions.

I was wondering if anyone has come across this kind of behaviour before? I find it strange that a more recent version is not performing as it should. Or am I missing something obvious? I would be very grateful for any suggestions on why DownsampleSam is behaving differently between different versions. Thanks in advance!

bam rna-seq downsample picard • 977 views
ADD COMMENTlink written 8 months ago by ra0

The documentation states:

Reads from the same template (e.g. read-pairs, secondary and supplementary reads) are all either kept or discarded as a unit, with the goal of retaining readsfrom PROBABILITY * input templates. The results will contain approximately PROBABILITY * input reads, however for very small PROBABILITIES this may not be the case.

[source: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_sam_DownsampleSam.php]

This implies that secondary reads may be kept under certain circumstances.

ADD REPLYlink modified 8 months ago • written 8 months ago by Kevin Blighe37k

Thank you for your reply and highlighting the above information from the document.

Following on from this, the help information (from picard DownsampleSam -h) for Picard release 1.130 states:

Randomly down-sample a SAM or BAM file to retain a random subset of the reads. Mate-pairs are either both kept or both discarded. Reads marked as not primary alignments are all discarded. Each read is given a probability P of being retained - results with the exact same input in the same order and with the same value for RANDOM_SEED will produce the same results. Version: 1.130

This does not go into as much detail as the newer release help information. It looks like picard release 1.130 doesn't treat the secondary reads in the same way as the newer release; as it discards all secondary reads, whereas the newer release either keeps or discards all as a unit. This is a naive question, but am I correct in thinking that I should accept that all secondary reads are kept in my particular .bam file when using the newer release?

Thanks for your help!

ADD REPLYlink modified 8 months ago • written 8 months ago by ra0

It's quite possible that the different versions treat these reads differently. The most up to date documentation contradicts itself, as you have noticed.

On the GitHub page, the 'updated' documentation states:

Reads in a mate-pair are either both kept or both discarded. Reads marked as not primary alignments are all discarded. Each read is given a probability P of being retained so that runs performed with the exact same input in the same order and with the same value for RANDOM_SEED will produce the same results.All reads for a template are kept or discarded as a unit, with the goal of retaining readsfrom PROBABILITY * input templates.

[source: http://broadinstitute.github.io/picard/command-line-overview.html#Overview]

So, it's difficult to know what exactly it's doing as the documentation itself is contradictory. You can possibly follow the code trail to see what exactly it's doing: https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/sam/DownsampleSam.java

ADD REPLYlink written 8 months ago by Kevin Blighe37k

Thanks for getting back, I appreciate your time.

I completely agree with you that the documentation are not clear in stating exactly how DownsampleSam is handling secondary reads.

I had a look at the code trail as you kindly suggested. This may be due to my lack of understanding, but it looks like the code does not specifically indicate how the secondary reads are handled? The code at line 231 is:

log.info("Kept ", iterator.getAcceptedCount(), " out of ", iterator.getSeenCount(), " reads (", fmt.format(iterator.getAcceptedFraction()), ").");

But I think the above simply outputs how many reads have been returned (the downsampled reads) out of the number of reads seen (the reads taken forward for downsampling)? I can't find where the code defines exactly how the getSeenCount() is obtained. Could you please point me in the right direction in this case?

I also had a look at the code trail for the specific picard release I have been using (2.4.1), which should discard all reads marked as not primary alignments (similar to the 'updated' documentation). However, again I can't find this specifically in the code trail.

From what I can see, it seems that picard release 2.4.1 should discard all secondary alignments, but for my .bam file it does not.

I hope you can answer my queries above.

ADD REPLYlink written 8 months ago by ra0
1

I also looked through the code myself but could not find a definitive part where it states at which SAM flags Picard is looking for DownsampleSam when extracting reads. This may be a question better directed at the developers, in that case.

ADD REPLYlink written 8 months ago by Kevin Blighe37k

Absolutely, neither the documentation or the code are very clear in this regard.

Thanks for your time Kevin, your inputs have been very helpful!!

ADD REPLYlink written 8 months ago by ra0
1

No problem. For the record, DownsampleSam is what I use for step 4 of the pipeline that I mention here: A: Best tool for variant calling

I had never considered the issues that you have found with this function, though.

ADD REPLYlink written 8 months ago by Kevin Blighe37k

That's great Kevin, thanks for the link above! It's good to know that you also use DownsampleSam for downsampling, as opposed to the other tools I had considered.

From what I can tell, this issue I am having hasn't been mentioned before by anyone else. Out of curiosity, I downsampled a .bam file from a different study and again, the secondary reads are not being discarded. It would be good to confirm if anyone else has found this issue too, but as you have suggested, it appears this is an issue in the function itself...

ADD REPLYlink written 8 months ago by ra0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1399 users visited in the last hour