How to extract reads from BAM file and shuffle in another BAM to simulate a new contaminated BAM file
1
0
Entering edit mode
7.7 years ago
MAPK ★ 2.1k

I am trying to create a BAM file merging two different BAM files simulating at different level of contamination. Suppose I have two bam files for SampleA and SampleB. I want to generate 5 different BAM files at contamination level of 10%, 20%, 30%,40% and 50%. I understand that I should extract reads from one of the two bam files at the given dilution percentage and reassign to the other bam file, but I don't know exactly how to do this. Can someone please explain me the procedure? Thanks

NGS BAM • 3.7k views
ADD COMMENT
0
Entering edit mode

Sub-sample BAM to get # of alignments you want: Subsample BAM to fixed number of alignments Then merge with your other BAM (and shuffle if needed randomly shuffle BAM file ).

ADD REPLY
5
Entering edit mode
7.7 years ago
Gabriel R. ★ 2.9k

I did this for a paper so I can help you out :-) Suppose we want 30% Sample_A and 70% Sample_B. Let us assume that both BAM files have the same number of reads. What you can do is first subsample your sample_A:

samtools  view -b -s 0.3 sampleA.bam > /tmp/tA.bam

Then subsample the other BAM file:

samtools  view -b -s 0.7 sampleB.bam > /tmp/tB.bam

Then cat both of them:

samtools cat -o output30A70B.bam /tmp/tA.bam /tmp/tA.bam 
rm -f /tmp/tA.bam
rm -f /tmp/tB.bam

However, I prefer to use file descriptors:

samtools cat -o output30A70B.bam <( samtools  view -b -s 0.3 sampleA.bam) <( samtools  view -b -s 0.7 sampleB.bam)

I haven't tested the commands but let me know if you run into problems. If the initial BAM files do not have the same number of reads, then you need to compute how much to subsample from each file.

ADD COMMENT
0
Entering edit mode

I think samtools merge might be preferable to just catting .bams, since there are EOF characters, and maybe other things.

ADD REPLY
0
Entering edit mode

I assumed that the initial BAM files are not sorted. samtools cat does not merely call the cat from Linux, it is coded to handle EOF characters given if all BAM files are fine and have the EOF character at the end.

ADD REPLY
0
Entering edit mode

@Gabriel R. Thanks. What if the number of reads are different in two different BAM files? How do we create dilution under such circumstances?

ADD REPLY
2
Entering edit mode

Well ask yourself how many reads you want in the final BAM. If say I want 100k reads with 40%A and 60%B, then you want 40k from Sample_A and 60k reads from Sample_B. If sample_A contains 1M reads, then you want to subsample with prob=40/1000 from it. If sample_B contains 1.5M reads, then you want to subsample with prob=60/1500 from it. Be sure to recompute the final contamination rate as it may be slightly different from the simulated one. Also, if you plan on doing this multiple times, I think there is a seed in the random # generator so you might want to look into that. I also built my own subsampler for a fixed # of reads:

https://github.com/grenaud/libbam/blob/master/subsamplebamFixedNumber.cpp

Hope this helps!

  • Gabriel
ADD REPLY
1
Entering edit mode

Thanks. I have found Picard tools very useful for this purpose including down sampling the bam files using CalculateHsMetrics tool.

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6