BAMboozle
1
0
Entering edit mode
11 months ago
mropri ▴ 150

Hi,

I am running BAMboozle to anonymize variant sequences using the GRCh37 human reference genome on my bam files. My bam files originally are 2-3 GB but when I get the output bam file from BAMboozle it is 500-600 Kb. Does BAMboozle decrease the size of the bam file. I would say no from what I have looked into it, so is the task not getting completed because the reference genome used is the incorrect one or is there another reason. Appreciate any help.

BAMboozle • 929 views
ADD COMMENT
0
Entering edit mode
11 months ago

The tool will only keep primary alignments and will discard unaligned reads. Then it simplifies alignments to be exact matches so there will be a whole lot less data to store, in addition, simpler data compresses much better.

So the output ought to be substantially smaller, and the size reduction would depend on the specifics of the alignments. What percent were primary alignments, how many variants, and how many unaligned reads were there.

Run samtools flagstat before and after bamboozling.

ADD COMMENT
0
Entering edit mode

First, thanks for your reply. I ran samtools flagstat before BAMboozle and got the following metrics:

41946670 + 0 in total (QC-passed reads + QC-failed reads)
6556652 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
41935169 + 0 mapped (99.97% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

After BAMboozle, the stats are the following

247042 + 0 in total (QC-passed reads + QC-failed reads)
220252 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
235541 + 0 mapped (95.34% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

This seems odd that the reads that passed QC go from 41 million to 247,042. This is also keeping unmapped reads so I imagine mapped reads are even less. Is this a normal reduction in output from BAMboozle?

ADD REPLY
0
Entering edit mode

BAMboozle is somehow bringing the reads down to a smaller number. You also have a large amount of secondary alignments for these reads. Do you have a link for this package?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

there is no QC there, that's just a wording,

I don't think the output is correct. frankly, you might be running it incorrectly, or something else is amiss

you can easily check the alignments and see whether it keeps the alignments it should

for example, in my reading, your total number of alignments after bamboozling should be

samtools view -c -q 1 -F 4 -F 256 -F 2048  original.bam

the above command counts alignments after removing multi-mapped (-q 1) unmapped (-F 4), secondary (-F 256), and, supplementary (-F 2048) alignments

then on the alignments that it keeps the tool is supposed to replace variants with exact matches the CIGAR should be all 150M

ADD REPLY
0
Entering edit mode

Running you above command on my bam file before BAMboozle I get:

35119839

After BAMboozle, the command outputs:

3003

Its confusing because BAMboozle (link provided above in response to GenoMax) just requires the bam file and the reference genome fasta. I have tried GENCODE, NCBI and two other version of GRCh37 HUMAN fasta files and still the output bam file is KiB rather than GiB

ADD REPLY
0
Entering edit mode

I am curious about these two options in BAMboozle. Does this mean it will remove these two types of reads if you don't include those options. Can you try by adding one or both below?

--keepsecondary  Keep secondary alignments in output bam file.
--keepunmapped   Keep ummapped reads in output bam file.

How many original reads did you have in the fastq files?

ADD REPLY
0
Entering edit mode

yes, if we don't specify this option these reads are not kept. Also I figured out the problem. The fasta file has to be the exact match to the gtf reference genome file that I used to align the samples. So I used GRCh38.101 from ENSEMBL to align my files so I needed to use the fasta file of the version of the genome. I am so sorry for the mistake and am really appreciative of all your guys' help

ADD REPLY

Login before adding your answer.

Traffic: 2468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6