I am trying to make smaller many BAM files (around 60 of them) of size ~200GB (due to disk space limitations) by removing base qualities and tags and other unwanted information. Doing copy number analysis, for me base qualities and tags and duplicates are somehow unwanted information. What I only care about is the mapping quality (MAPQ) since I filter low quality reads!
Currently, I am using bamUtils squeeze command. I don't know yet how good this tool is in making the bam file smaller! The squeeze sub command can replace QNAME with an integer, remove duplicates, and remove OQ tag (i.e. original base qualities) but not the QUAL field. However, for the QUAL field, the tool provides the binning option (to reduce the number of possible quality scores).
Previously, I used cgat bam2bam method=strip-quality which deletes only the QUAL field. This tool is slow (takes ~12 hours for a 160 GB BAM file) and didn't free much space. The modified BAM file was only 3GB smaller for a 160GB file.
I was wondering if deleting whatever comes after the SEQ (in a SAM/BAM file) will work (i.e QUAL and all other tags)? and if yes, what would be the fastest way to apply that? Or, if there is a tool available that I was not able to find?
Thanks in advance for sharing your ideas!
EDIT1: My question might have been misleading since I said "I only care about MAPQ". I also care about the FLAG and SEQ. Since later in the pipeline, I will call variants; but there only FLAG, MAPQ and SEQ are needed and not any thing else.
EDIT2: Now, I have the result from using bamUtils squeeze and to me the result is satisfactory. The BAM file is ~4-fold smaller when one:
- removes the OQ tag,
- removes the duplicates, and
- bin the base quality scores
And it took less than 4 hours (3:51) to squeeze a 160GB file to 43.5GB.