Question: randomly shuffle BAM file
2
gravatar for mparker2
4.0 years ago by
mparker220
Sheffield
mparker220 wrote:

Is there a tool like bedtools shuffle which I can use to randomly shuffle a bam file? or will I have to convert my bam into a bed and then shuffle it? thanks

shuffle bam bedtools • 4.1k views
ADD COMMENTlink modified 2.2 years ago by skanterakis100 • written 4.0 years ago by mparker220

I would use `shuf` for shuffling alignment lines taking care about header: detaching it before and reattaching it after

ADD REPLYlink written 4.0 years ago by Pavel Senin1.9k
3

`shuf` loads everything in memory...

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum121k

good point, will `sort --random-sort` do the job? (but it'll be slow...)

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Pavel Senin1.9k

In case you ever have your bam as GRanges (GenomicRanges package) in R, you may use the sample(GRanges, size, replace=FALSE) command.

ADD REPLYlink written 2.2 years ago by ATpoint19k
2
gravatar for Pierre Lindenbaum
4.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

use https://github.com/lindenb/jvarkit/wiki/Biostar145820 to shuffle the reads with option `-n -1`

ADD COMMENTlink written 4.0 years ago by Pierre Lindenbaum121k

does this program actually move the positions of reads on the chromosome randomly, like I think bedtools shuffle does, or does it just just shuffle the line numbers of reads in the file? its not clear to me from the page

ADD REPLYlink written 4.0 years ago by mparker220

it shuffles the reads in the file.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Pierre Lindenbaum121k
1
gravatar for Len Trigg
4.0 years ago by
Len Trigg1.3k
New Zealand
Len Trigg1.3k wrote:

If your BAM file contains paired-end data, you may want to be careful that reads for each arm are in proximity to each other, and I believe the above methods will result in the arms ending up scattered.

For example, a common use case for "shuffling" a BAM is to take an already aligned BAM and shuffle it in order to eliminate biases you get when mapping a bam that is sorted in chromosome order. If this is your motivation for doing the shuffling, you would be best to use samtools bamshuf (which effectively sorts by a hash of the read name, so paired end reads are kept together, but avoiding a merge step that a proper sort-by-name requires).

 

ADD COMMENTlink written 4.0 years ago by Len Trigg1.3k
1
gravatar for skanterakis
2.2 years ago by
skanterakis100
United Kingdom
skanterakis100 wrote:

samtools bamshuf file.bam does the job. It'll also break up your bam in -n (default=64) chunks. Unfortunately though, it wants to shuffle the entire bam and then writes the chunks to disk (does not stream even with the -O option). If you just want to sample a few reads from a large bam quickly, you can "shuffle" some coordinates and then use samtools view file.bam chr:start-start | head -1 to get a read. If you want read pairs, you can either use samtools bamshuf or sort the bam by name(samtools sort -n).

ADD COMMENTlink written 2.2 years ago by skanterakis100

Just to point out samtools bamshuf is found only up to SAMtools 1.2, from SAMtools 1.3 on, the command is samtools collate. In addition, samtools collate does output to stdout (I didn't test with bamshuf but maybe works too), but it needs to create the intermediary files nonetheless.

samtools collate -u -O file.bam tmp | whatever

This works, but creates tmp.001.bam, tmp.002.bam`, and so on.

ADD REPLYlink written 21 months ago by h.mon26k
0
gravatar for Ian
4.0 years ago by
Ian5.5k
University of Manchester, UK
Ian5.5k wrote:

You could use the bedtools suite of programs.

bamtobed > shuffle > bedtobam

ADD COMMENTlink written 4.0 years ago by Ian5.5k
0
gravatar for Alex Reynolds
4.0 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

One option is to use bam2bed in conjunction with sample:

$ bam2bed < reads.bam > reads.bed
$ sample -k 1234 reads.bed > random_sample.bed

One advantage of sample over GNU shuf is that shuf loads everything into memory before shuffling, while sample uses reservoir sampling on line offsets so that the memory overhead is much, much lower. Memory usage can be an issue for very large input files.

ADD COMMENTlink written 4.0 years ago by Alex Reynolds28k
0
gravatar for plijnzaad
3.3 years ago by
plijnzaad30
European Union
plijnzaad30 wrote:

I tend to specify a fraction of whatever lines I want (which I have to know in advance), and then use rejection sampling (i.e., if (rand() > fraction) { print it }; ).This can use stdin and stdout, is exceedlingly fast, but maintains the original order. If that is undesirable, you can as yet subject the result to sample (which sounds like a useful thing)

ADD COMMENTlink written 3.3 years ago by plijnzaad30
0
gravatar for plijnzaad
3.3 years ago by
plijnzaad30
European Union
plijnzaad30 wrote:

Oh, and I just see that samtools view has an option -s SEED.percentage

   -s FLOAT integer part sets seed of random number generator [0];
               rest sets fraction of templates to subsample [no subsampling]

This is prolly the fastest way to do it.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by plijnzaad30

I don't think this randomizes the file, but rather takes a random subsample.

ADD REPLYlink written 3.3 years ago by Matt Shirley9.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1744 users visited in the last hour