What is the best way to make a bam file smaller by removing unwanted information?
4
1
Entering edit mode
5.7 years ago
Dataman ▴ 340

Hi! 

I am trying to make smaller many BAM files (around 60 of them) of size ~200GB (due to disk space limitations) by removing base qualities and tags and other unwanted information. Doing copy number analysis, for me base qualities and tags and duplicates are somehow unwanted information. What I only care about is the mapping quality (MAPQ) since I filter low quality reads!

Currently, I am using bamUtils squeeze  command. I don't know yet how good this tool is in making the bam file smaller! The squeeze sub command can replace QNAME with an integer, remove duplicates, and remove OQ tag (i.e. original base qualities) but not the QUAL field. However, for the QUAL field, the tool provides the binning option (to reduce the number of possible quality scores). 

Previously, I used cgat bam2bam method=strip-quality which deletes only the QUAL field. This tool is slow (takes ~12 hours for a 160 GB BAM file) and didn't free much space. The modified BAM file was only 3GB smaller for a 160GB file.

I was wondering if deleting whatever comes after the SEQ (in a SAM/BAM file) will work (i.e QUAL and all other tags)? and if yes, what would be the fastest way to apply that? Or, if there is a tool available that I was not able to find?

Thanks in advance for sharing your ideas!

EDIT1: My question might have been misleading since I said "I only care about MAPQ".  I also care about the FLAG and SEQ. Since later in the pipeline, I will call variants; but there only FLAG, MAPQ and SEQ are needed and not any thing else.  

EDIT2: Now, I have the result from using bamUtils squeeze and to me the result is satisfactory. The BAM file is ~4-fold smaller when one:

  1. removes the OQ tag,
  2. removes the duplicates, and
  3. bin the base quality scores

And it took less than 4 hours (3:51) to squeeze a 160GB file to 43.5GB.

next-gen bam sequencing compression • 4.0k views
ADD COMMENT
2
Entering edit mode

While its possible to delete data in a BAM file, this question makes me a little uneasy because it sounds like you're trying to fit a round peg into a square hole. Why not keep the raw BAM data on DVDs or something, and then just extract the information you need/want (in a different file format).

For example, for CNV you might just want to know how many reads map to each spot on the genome. In that case, just make a BED/BigWig file? If you need the MAPQ, you could make a BED file of the running MAPQ average/max/min. Two BED files of that sort would most likely be less than 100Mb each :)

Just a suggestion - I dont know your downstream steps so maybe none of this is relevant.

ADD REPLY
0
Entering edit mode

That is a plausible thought and I guess in the end I will end up doing something like: download a batch of BAM files (as far as the disk space permits), do the downstream analysis, delete the BAM files, start a new batch!

ADD REPLY
6
Entering edit mode
5.7 years ago

I quickly wrote a tool: see https://github.com/lindenb/jvarkit/wiki/Biostar173114

I checked, it's compatible with samtools.

$ java -jar dist/biostar173114.jar --formatout sam  my.bam  | samtools view -h | head -n 100

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:1    LN:249250621
@SQ SN:2    LN:243199373
(...)
@SQ SN:GL000192.1   LN:547496
R0  0   1   822010  55  51M *   0   0   *   *
R1  16  1   822010  55  51M *   0   0   *   *
R2  0   1   829084  60  87M13S  *   0   0   *   *
R3  0   1   829084  60  87M13S  *   0   0   *   *
R4  0   1   829084  60  87M13S  *   0   0   *   *
R5  16  1   829122  59  48M36D52M   *   0   0   *   *
R6  16  1   829122  59  48M36D52M   *   0   0   *   *
R7  16  1   829122  59  48M36D52M   *   0   0   *   *
R8  0   1   848195  46  19S20M3S    *   0   0   *   *
R9  16  1   848195  46  19S20M3S    *   0   0   *   *
Ra  0   1   908830  60  47M *   0   0   *   *
Rb  16  1   908830  60  47M *   0   0   *   *
Rc  0   1   922930  60  100M    *   0   0   *   *
Rd  0   1   922930  60  100M    *   0   0   *   *
Re  0   1   922930  60  100M    *   0   0   *   *

and for better results, you should process your data in parallel.

ADD COMMENT
0
Entering edit mode

Fast a hell - nice one Pierre :D

ADD REPLY
0
Entering edit mode

Thanks for the quick code! :) Is that right that your code replaces the SEQ with a *? If so, I guess I was unclear in my question when I said I only cared about the MAPQ. I also care about the SEQ itself. In that case I need to edit the question.

EDIT: I tried to use your code but I encountered a compilation issue (Apache ANT is not available ). I am looking into that!

ADD REPLY
1
Entering edit mode

I can add a flag to remove the seq. give me a few minutes...

ADD REPLY
0
Entering edit mode

done: . default is showing the bases. I also removed the hard clips.

ADD REPLY
3
Entering edit mode
5.7 years ago
trausch ★ 1.6k

You may want to use CRAM (reference-based BAM compression)

http://www.ebi.ac.uk/ena/software/cram-toolkit

htslib already supports CRAM so tools build on top of htslib can read CRAM natively.

ADD COMMENT
1
Entering edit mode

I am not sure whether this is an immediate and fast solution for my issue since my pipeline of copy number analysis works currently with BAM files only!

ADD REPLY
1
Entering edit mode

If CRAM saves enough space, maybe you could convert to BAM on the fly for your CNA pipeline?

To convert a CRAM file to BAM:

java -jar cramtools-3.0.jar bam --input-cram-file <input cram file> --reference-fasta-file <reference fasta file> --output-bam-file <output bam file>
ADD REPLY
0
Entering edit mode

This is also a good idea!

ADD REPLY
2
Entering edit mode
5.7 years ago

If you really want to throw stuff away from a bam file you could use something along these lines

samtools view -F 3844 -q 5 -h in.bam \
| awk -v OFS="\t" '{if($0 ~ /^@/){print $0} else {print NR, $2, $3, $4, $5, $6, $7, $8, $9, "*", "*"}}' \
| samtools view -Sb - > squeezed.bam

Essentially you use awk to replace the read name with a numeric ID (NB: paired reads will have different names!) and keep only stuff you need from each read. The first samtools command also throws away reads with mapq <5 and duplicates and other possibly unwanted reads.

In terms of speed it might not be the best solution, but I guess acceptable...

Needless to say, make some tests first!

EDIT: You could even remove the read name altogether and put * instead. But really, I agree with John's comment: Are you sure you don't have better solutions?

ADD COMMENT
0
Entering edit mode

Thanks! Your answer implements what I was originally looking for! I just did change your answer a little bit to replace read name with '*', include the SEQ and remove sequences with MAPQ below 10! Now, I am checking how much gain (in terms of space) I can get from this approach and how fast it is! This is the modified code I am using:

samtools view -F 3844 -q 10 -h in.bam | awk -v OFS="\t" '{if($0 ~ /^@/){print $0} else {print "*", $2, $3, $4, $5, $6, $7, $8, $9, $10, "*"}}' | samtools view -Sb - > squeezed.bam

EDIT: I now have the results and it is very promising! The file is now ~8-fold smaller (160GB -> 18.6GB)!

ADD REPLY
2
Entering edit mode
5.7 years ago

Can't you subset your BAM with only targeted region if relevant for your study? Example:

samtools view -b alignment.bam refname:start-stop > sampled_alignment.bam
ADD COMMENT
0
Entering edit mode

Unfortunately, I need to do whole genome CNA analysis. As a result, sub-setting would not solve the issue of disk space because after all I need to keep all the subsets on the disk.

ADD REPLY

Login before adding your answer.

Traffic: 1302 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6