I am attempting to run a comparison a batch of alignments (bam format) to assess spread. However, some of my datasets are larger than others and I wish to subsample down to an equal number of reads per sample.
If I wanted to randomly extract 1 Million reads from a bam file is there a method to do this?
Note: I am fully aware of the samtools and picard methods which allow you to reduce by a proportion (i.e. the flag -s 0.33) but that would not result in an fixed number of reads per sample, which is what I need, but a reduced proportion per sample which doesn't help.
This is based on some of the earlier threads mentioned in the comments, but you can use the GNU shuf utility to get a fixed number of random lines in a file. Recent versions of the tool use reservoir sampling, which will improve efficiency. You could run something like:
shuf -n 1000000 input.sam > output.sam
To get the header correctly, you could do something like:
Is there a reason you cannot use bash like several have commented on in the posts you link to? For example, see Sample Sam File as it shows how to get 1 million randomly selected reads from your bam file.
I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.
Well it depends on the downstream application, but you can often get around writing intermediates to disk by redirection ...
Yeah, I see. I just expected there to be something slick as I imagine it's a common process. Thanks for the info.
reformat.shfrom BBMap can accept a
sample=parameter and can likely handle BAM files. Odds are good that that'll work well, but I've never tried.
reformat.shwon't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.
I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savvy. I tried all combinations of sam and bam for ins and outs too. If of interest:
Perhaps this wouldn't yield the error:
Wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously.
Rule 1: RTFM
I'd just like to add one comment...
This command will work fine, with 2 caveats. Firstly,
reformat.shwill keep paired reads together for fastq and fasta input, but NOT for sam/bam input. Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you set
Cool, I'd be interested in hearing the difference in time required.
Edit: Though really this should be IO bound...
Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow.