Question: Subsample BAM to fixed number of alignments
3
gravatar for Daniel
4.1 years ago by
Daniel3.7k
Cardiff University
Daniel3.7k wrote:

I am attempting to run a comparison a batch of alignments (bam format) to assess spread. However, some of my datasets are larger than others and I wish to subsample down to an equal number of reads per sample.

If I wanted to randomly extract 1 Million reads from a bam file is there a method to do this?

Note: I am fully aware of the samtools and picard methods which allow you to reduce by a proportion (i.e. the flag -s 0.33)  but that would not result in an fixed number of reads per sample, which is what I need, but a reduced proportion per sample which doesn't help.

Bam subsampling has previously been talked about it but always the proportional data reduction, not the fixed number required: Sample Sam FileA: Subsampling Bam File With SamtoolsHow to reduce reads in fastq, bam, bed files in a random manner?https://broadinstitute.github.io/picard/command-line-overview.html

Edit: Also, I've come across bamtools  but haven't been able to get it to work and seems to have not been updated in quite some time

subsampling bam • 6.8k views
ADD COMMENTlink modified 4.1 years ago by Alex Reynolds28k • written 4.1 years ago by Daniel3.7k
1

Is there a reason you cannot use bash like several have commented on in the posts you link to?  For example, see C: Sample Sam File as it shows how to get 1 million randomly selected reads from your bam file.

ADD REPLYlink written 4.1 years ago by alolex890

I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.

ADD REPLYlink written 4.1 years ago by Daniel3.7k

Well it depends on the downstream application, but you can often get around writing intermediates to disk by redirection ...

myprogram -f1 file1.bam -f2 <(samtools view file2.bam | head -1000000) -f3 file3.bam
ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by george.ry1.1k

Yeah, I see. I just expected there to be something slick as I imagine it's a common process. Thanks for the info.

ADD REPLYlink written 4.1 years ago by Daniel3.7k

reformat.sh from BBMap can accept a sample= parameter and can likely handle BAM files. Odds are good that that'll work well, but I've never tried.

ADD REPLYlink written 4.1 years ago by Devon Ryan91k

BTW, if reformat.sh won't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.

ADD REPLYlink written 4.1 years ago by Devon Ryan91k

I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savy. I tried all combinations of sam and bam for ins and outs too. If of interest:

daniel-desktop:~/Raw_Data/ALD/SAMs$ ~/Local/bbmap/reformat.sh -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t
java -ea -Xmx200m -cp /home/sbi6dap/Local/bbmap/current/ jgi.ReformatReads -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar 
Executing jgi.ReformatReads [-in, ES09.bam, -out, ES09_1m.sam, samplereadstarget=1000000, mappedonly=t]

Unknown parameter ES09_1m.sam
Exception in thread "main" java.lang.AssertionError: Unknown parameter ES09_1m.sam
    at jgi.ReformatReads.<init>(ReformatReads.java:168)
    at jgi.ReformatReads.main(ReformatReads.java:45)
ADD REPLYlink written 4.1 years ago by Daniel3.7k

Perhaps this wouldn't yield the error:

~/Local/bbmap/reformat.sh in=ES09.bam out=ES09_1m.sam samplereadstarget=1000000 mappedonly=t
ADD REPLYlink written 4.1 years ago by Devon Ryan91k

wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously.

Rule 1: RTFM 

ADD REPLYlink written 4.1 years ago by Daniel3.7k
1

I'd just like to add one comment...

This command will work fine, with 2 caveats.  Firstly, reformat.sh will keep paired reads together for fastq and fasta input, but NOT for sam/bam input.  Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you set primaryonly=t).

ADD REPLYlink written 4.1 years ago by Brian Bushnell16k

Cool, I'd be interested in hearing the difference in time required.

Edit: Though really this should be IO bound...

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Devon Ryan91k

Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow.

ADD REPLYlink written 4.1 years ago by Daniel3.7k
Hi Daniel, Do you want to have 1M reads or 1M alignments? While sampling the latter you might need to control for the NH tag and the secondary alignment flag. Otherwise, you could extract the unique read ids, shuffle and select 1M reads and filter your BAM file with these ids.
ADD REPLYlink written 4.1 years ago by michael.ante3.3k
4
gravatar for matted
4.1 years ago by
matted7.0k
Boston, United States
matted7.0k wrote:

This is based on some of the earlier threads mentioned in the comments, but you can use the GNU shuf utility to get a fixed number of random lines in a file.  Recent versions of the tool use reservoir sampling, which will improve efficiency.  You could run something like:

shuf -n 1000000 input.sam > output.sam

To get the header correctly, you could do something like:

cat <(samtools view -SH input.sam) <(samtools view -S input.sam | shuf -n 1000000) > output.sam
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by matted7.0k
1

Thanks, this seems a lot less onerous than when I first looked at the other options with the intermediate files. Pipes save the day. I'm still surprised that there isn't a canonical tool for this though.

For completeness, taking BAM inputs, I'm using:

cat <(samtools view -H input.bam) <(samtools view -q 255 input.bam | shuf -n 1000000) > output.sam
ADD REPLYlink written 4.1 years ago by Daniel3.7k

My guess as to why there's no existing tools to do this (as far as we know) is that random subsampling gives answers that are almost correct, in terms of the number of output reads, and is easier to implement.

If you take the sampled number of reads as a binomial random variable, the standard deviation is pretty tiny compared to the number of reads (at most 1000 reads when sampling 1M, which is 0.1% of the total).  In fact, the standard deviation as a fraction of the expected number of sampled reads N is at most 1/sqrt(N).  The argument might be that that 0.1% variation shouldn't cause problems, or if it does, you may already be in trouble.

ADD REPLYlink written 4.1 years ago by matted7.0k

reformat.sh will give an exact number of reads (or bases), but at the expense of reading the file twice.  For an approximate answer, it only has to read the file once, which is faster.

ADD REPLYlink written 4.1 years ago by Brian Bushnell16k
3
gravatar for Alex Reynolds
4.1 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

One approach is to:

  • Convert BAM to SAM with samtools
  • Strip the header from the parent SAM file
  • Use sample to sample 1M lines without replacement from the parent, headerless SAM, which uses reservoir sampling of line offsets to get around memory limitations in GNU shuf
  • Reheader the SAM-formatted sample
  • Recompress the reheadered, SAM-formatted sample to a BAM file

In explicit code:

$ samtools view -h -o reads.sam reads.bam
$ grep -v '^@' reads.sam > nohead_reads.sam
$ sample --sample-size 1000000 --sample-without-replacement --preserve-order nohead_reads.sam > nohead_reads_1M_sample.sam
$ samtools reheader nohead_reads_1M_sample.sam reads.bam
$ samtools view -bS nohead_reads_1M_sample.sam -o reads_1M_sample.bam

This process can be easily automated and parallelized.

ADD COMMENTlink modified 2.7 years ago • written 4.1 years ago by Alex Reynolds28k

Just been having a poke around the code for your sample program.  Good stuff, and quick too!  That's gone straight into my /u/l/bin!

Seeing as I did have the source, though, I did reformat the help text to fit 80 characters ;) You're toying with my sanity!

ADD REPLYlink written 4.1 years ago by george.ry1.1k

sample tool looks really cool. but i have a question on samtools reheader nohead_reads_1M_sample.sam reads.bam step, nohead_reads_1M_sample.sam doesn't have header and samtool reheader is supposed to use input sam header to modify input bam file. So this line doesn't make sense, right?

ADD REPLYlink written 2.5 years ago by epigene450

I have tried the suggestion above with a little modification. My case is to sample 20 M reads each file on hundreds of bam files with replacement . The "sample" is the best tool.

samtools view -o –h reads.sam reads.bam #with the good header
sample -k 20000000 -r -p reads.sam > reads_20M.sam
samtools view -bS reads_20M.sam > reads_20M.bam
samtools reheader reads_20M.bam reads.sam | samtools sort -o reads_20M.sorted.bam

Finally I lost some reads, because the sampling also happened at the readgroup. But with the bad sampling header, the reads_20M.sam could be coverted to the bam format.

samtools view -c reads_20M.sorted.bam
19999810
ADD REPLYlink modified 3 months ago • written 3 months ago by zhaolin200420080
2
gravatar for Pierre Lindenbaum
4.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

I've just written one: https://github.com/lindenb/jvarkit/wiki/Biostar145820

$ samtools view -bu -q 60 input.bam |\

java -jar dist/biostar145820.jar -n 10  -o out.bam
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Pierre Lindenbaum121k

It looks like tomorrow is going to be a day of testing bam subsampling methods! Thanks, I'll check it out.

ADD REPLYlink written 4.1 years ago by Daniel3.7k

Given Brian's mention of the two-pass nature of reformat.sh for extracting exact numbers of reads, I suspect that this java solution will be the fastest (excluding IO bottlenecks).

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Devon Ryan91k
1

I don't think that's right.  Glancing through the code, Pierre's solution shuffles (sorts by random indices) the entire BAM (say size N), then takes the first M.  So the runtime is O(N*log(N)).  The memory usage is trickier to analyze because sorting will go to disk, but it's at least O(M).

Alex's solution (and mine with shuf, assuming a new enough version of shuf) is O(N), with O(M) memory usage by reservoir sampling.  And the random sampling methods (with approximate total counts) are O(N) with O(1) memory usage.

ADD REPLYlink written 4.1 years ago by matted7.0k

Indeed, I guess I hadn't looked at the code closely enough.

ADD REPLYlink written 4.1 years ago by Devon Ryan91k

I didn't read the whole thread, but, am i wrong: when using a reservoir sampling with a small number of reads, the output file tends towards containing the last reads in the BAM ? So, if a large bam contains unmapped reads (at the end) and N=10, for example, the output will only contains unmapped reads. ?

ADD REPLYlink written 4.1 years ago by Pierre Lindenbaum121k
1

With a small N and large number of reads, the probability that rand(1,i)<=N, for the ith entry in the file, will become smaller with increasing i. Actually, this is true for any N.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1034 users visited in the last hour