Question: Downsampling Bam Files
6
gravatar for Allpowerde
6.9 years ago by
Allpowerde1.2k
Allpowerde1.2k wrote:

HI I have an insane coverage in my bamfiles (thanks to only 12way multiplexing of a tiny genomic region). Since some programs can't handle insane coverages, I would like to downsample, that is pick say 1000 aligned reads for each position randomly (E.g. with bamtools random: Select random alignments from existing BAM file(s)) but I need to keep the mates intact. Another way of putting this: I want to achieve a maximal read-depth of 1000 for each position.

Has anyone experience with downsampling? Are there any pitfalls I might not be aware of?

read bam random • 13k views
ADD COMMENTlink written 6.9 years ago by Allpowerde1.2k
1

Well it is a targeted resequencing experiment with multiple overlapping PCR fragments so I have spikes of coverage going into the 100,000+ and then I have areas which hardly have any reads at all. But conceptually I'm after the functionality GATK offers with -dcov but I want to have the result written as a bam file.

ADD REPLYlink written 6.9 years ago by Allpowerde1.2k
1

3 answers none chosen as correct or even upvoted! I'm only answering for the karma ;)

ADD REPLYlink written 6.9 years ago by Casbon3.1k

When you say "insane", can you quantify the range please?

ADD REPLYlink written 6.9 years ago by jvijai1.1k
12
gravatar for lh3
6.4 years ago by
lh330k
United States
lh330k wrote:

Downsampling support has been added to "samtools view" in SVN. It always keeps both ends of a pair using a very simple strategy (<10 lines of additional source code). Credit to Hideo Imamura at Sanger.

ADD COMMENTlink written 6.4 years ago by lh330k

has anybody tried this? I am not able to find any documentation of this functionality in samtools view. Thanks!

ADD REPLYlink written 6.1 years ago by Wen.Huang1.1k

For archiving's sake, the option is named 'subsampling' and can be called with the -s switch (code), see Subsampling Bam File With Samtools.

ADD REPLYlink written 3.5 years ago by Leonor Palmeira3.6k

picard tools has it, trying it...

ADD REPLYlink written 6.0 years ago by Wen.Huang1.1k
2
gravatar for Bio_X2Y
6.9 years ago by
Bio_X2Y3.4k
Ireland
Bio_X2Y3.4k wrote:

Perhaps a lot of your reads are PCR artifacts - maybe you can consider removing them using samtools (the rmdup feature). Since your reads are paired-end, the script will consider pairs A-B and C-D to be duplicates if A and C map to the same co-ordinate and also B and D map to the same co-ordinate. As an added bonus, the script retains the pairs with the highest quality.

Even if you choose not to use this for your main downsizing, it might be a good first step to reduce the size of your BAM in a rational way.

ADD COMMENTlink written 6.9 years ago by Bio_X2Y3.4k
1
gravatar for Casbon
6.9 years ago by
Casbon3.1k
Casbon3.1k wrote:

I have done this for dealing with similar coverages. The pitfall is if you are downsampling you don't want to lose signal from either allele (this is diploid?) or any sample. 1000/12 = 83 reads on average. Naive binomial probability of missing an allele is about 1e-25, so nothing to worry about there. However, I would split the file by multiplex, downsample and rejoin to ensure equal reads of each sample

A problem you might have with this approach is that you might have high levels of noise/allelic imbalance due to polymerase errors. I would therefore prefer to use all the data wherever I can. I have been investigating chemware solutions to these problems, you might want to get in touch to dicsuss. Too late for these data.

ADD COMMENTlink written 6.9 years ago by Casbon3.1k

Yes, using all the data is a sensible thing to do although that does not protect from polymerase biases because programs usually go for the majority feature they see anyways, that's maybe why downsampling is default in GATK (to 250 reads). Hence doing the downsampling yourself might be a better way of handling things (especially since you can downsample on each sample and then combine -- as you suggested). But my question is still what program to use for the downsampling :)

Could you go into more details on what chemware's solution offers?

ADD REPLYlink written 6.9 years ago by Allpowerde1.2k
1
gravatar for Allpowerde
6.9 years ago by
Allpowerde1.2k
Allpowerde1.2k wrote:

It seems that GATK has also a script for downsampling: randomSampleFromStream.pl

ADD COMMENTlink written 6.9 years ago by Allpowerde1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 606 users visited in the last hour