Question: What is the best way to make a bam file smaller by removing unwanted information?
1
gravatar for Dataman
3.2 years ago by
Dataman260
Finland
Dataman260 wrote:

Hi! 

I am trying to make smaller many BAM files (around 60 of them) of size ~200GB (due to disk space limitations) by removing base qualities and tags and other unwanted information. Doing copy number analysis, for me base qualities and tags and duplicates are somehow unwanted information. What I only care about is the mapping quality (MAPQ) since I filter low quality reads!

Currently, I am using bamUtils squeeze  command. I don't know yet how good this tool is in making the bam file smaller! The squeeze sub command can replace QNAME with an integer, remove duplicates, and remove OQ tag (i.e. original base qualities) but not the QUAL field. However, for the QUAL field, the tool provides the binning option (to reduce the number of possible quality scores). 

Previously, I used cgat bam2bam method=strip-quality which deletes only the QUAL field. This tool is slow (takes ~12 hours for a 160 GB BAM file) and didn't free much space. The modified BAM file was only 3GB smaller for a 160GB file.

I was wondering if deleting whatever comes after the SEQ (in a SAM/BAM file) will work (i.e QUAL and all other tags)? and if yes, what would be the fastest way to apply that? Or, if there is a tool available that I was not able to find?

Thanks in advance for sharing your ideas!

EDIT1: My question might have been misleading since I said "I only care about MAPQ".  I also care about the FLAG and SEQ. Since later in the pipeline, I will call variants; but there only FLAG, MAPQ and SEQ are needed and not any thing else.  

EDIT2: Now, I have the result from using bamUtils squeeze and to me the result is satisfactory. The BAM file is ~4-fold smaller when one:

  1. removes the OQ tag,
  2. removes the duplicates, and
  3. bin the base quality scores

And it took less than 4 hours (3:51) to squeeze a 160GB file to 43.5GB.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Dataman260
2

While its possible to delete data in a BAM file, this question makes me a little uneasy because it sounds like you're trying to fit a round peg into a square hole. Why not keep the raw BAM data on DVDs or something, and then just extract the information you need/want (in a different file format).

For example, for CNV you might just want to know how many reads map to each spot on the genome. In that case, just make a BED/BigWig file? If you need the MAPQ, you could make a BED file of the running MAPQ average/max/min. Two BED files of that sort would most likely be less than 100Mb each :)
Just a suggestion - I dont know your downstream steps so maybe none of this is relevant.

ADD REPLYlink written 3.2 years ago by John12k

That is a plausible thought and I guess in the end I will end up doing something like: download a batch of BAM files (as far as the disk space permits), do the downstream analysis, delete the BAM files, start a new batch! 

ADD REPLYlink written 3.2 years ago by Dataman260
6
gravatar for Pierre Lindenbaum
3.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

I quickly wrote a tool: see https://github.com/lindenb/jvarkit/wiki/Biostar173114

I checked, it's compatible with samtools.

 

$ java -jar dist/biostar173114.jar --formatout sam  my.bam  | samtools view -h | head -n 100

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:1    LN:249250621
@SQ SN:2    LN:243199373
(...)
@SQ SN:GL000192.1   LN:547496
R0  0   1   822010  55  51M *   0   0   *   *
R1  16  1   822010  55  51M *   0   0   *   *
R2  0   1   829084  60  87M13S  *   0   0   *   *
R3  0   1   829084  60  87M13S  *   0   0   *   *
R4  0   1   829084  60  87M13S  *   0   0   *   *
R5  16  1   829122  59  48M36D52M   *   0   0   *   *
R6  16  1   829122  59  48M36D52M   *   0   0   *   *
R7  16  1   829122  59  48M36D52M   *   0   0   *   *
R8  0   1   848195  46  19S20M3S    *   0   0   *   *
R9  16  1   848195  46  19S20M3S    *   0   0   *   *
Ra  0   1   908830  60  47M *   0   0   *   *
Rb  16  1   908830  60  47M *   0   0   *   *
Rc  0   1   922930  60  100M    *   0   0   *   *
Rd  0   1   922930  60  100M    *   0   0   *   *
Re  0   1   922930  60  100M    *   0   0   *   *

 

and for better results, you should process your data in parallel.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Pierre Lindenbaum118k

Fast a hell - nice one Pierre :D

ADD REPLYlink written 3.2 years ago by John12k

Thanks for the quick code! :) Is that right that your code replaces the SEQ with a *? If so, I guess I was unclear in my question when I said I only cared about the MAPQ. I also care about the SEQ itself. In that case I need to edit the question.

EDIT: I tried to use your code but I encountered a compilation issue (Apache ANT is not available ). I am looking into that!

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Dataman260
1

I can add a flag to remove the seq. give me a few minutes...

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum118k

done: . default is showing the bases. I also removed the hard clips.

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum118k
2
gravatar for trausch
3.2 years ago by
trausch1.2k
Germany
trausch1.2k wrote:

You may want to use CRAM (reference-based BAM compression)

http://www.ebi.ac.uk/ena/software/cram-toolkit

htslib already supports CRAM so tools build on top of htslib can read CRAM natively.

ADD COMMENTlink written 3.2 years ago by trausch1.2k
1

I am not sure whether this is an immediate and fast solution for my issue since my pipeline of copy number analysis works currently with BAM files only!

ADD REPLYlink written 3.2 years ago by Dataman260
1
If CRAM saves enough space, maybe you could convert to BAM on the fly for your CNA pipeline?
To convert a CRAM file to BAM:
> java -jar cramtools-3.0.jar bam --input-cram-file <input cram file> --reference-fasta-file <reference fasta file> --output-bam-file <output bam file>
ADD REPLYlink written 3.2 years ago by Robert Sicko570

This is also a good idea! 

ADD REPLYlink written 3.2 years ago by Dataman260
2
gravatar for dariober
3.2 years ago by
dariober9.9k
WCIP | Glasgow | UK
dariober9.9k wrote:

If you really want to throw stuff away from a bam file you could use something along these lines

samtools view -F 3844 -q 5 -h in.bam \
| awk -v OFS="\t" '{if($0 ~ /^@/){print $0} else {print NR, $2, $3, $4, $5, $6, $7, $8, $9, "*", "*"}}' \
| samtools view -Sb - > squeezed.bam

Essentially you use awk to replace the read name with a numeric ID (NB: paired reads will have different names!) and keep only stuff you need from each read. The first samtools command also throws away reads with mapq <5 and duplicates and other possibly unwanted reads.

In terms of speed it might not be the best solution, but I guess acceptable...

Needless to say, make some tests first!

EDIT: You could even remove the read name altogether and put * instead. But really, I agree with John's comment: Are you sure you don't have better solutions?

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by dariober9.9k

Thanks! Your answer implements what I was originally looking for! I just did change your answer a little bit to replace read name with '*', include the SEQ and remove sequences with MAPQ below 10! Now, I am checking how much gain (in terms of space) I can get from this approach and how fast it is! This is the modified code I am using:

samtools view -F 3844 -q 10 -h in.bam | awk -v OFS="\t" '{if($0 ~ /^@/){print $0} else {print "*", $2, $3, $4, $5, $6, $7, $8, $9, $10, "*"}}' | samtools view -Sb - > squeezed.bam

EDIT: I now have the results and it is very promising! The file is now ~8-fold smaller (160GB -> 18.6GB)! 

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Dataman260
2
gravatar for Manu Prestat
3.2 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

Can't you subset your BAM with only targeted region if relevant for your study? Example:

samtools view -b alignment.bam refname:start-stop > sampled_alignment.bam
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Manu Prestat3.9k

Unfortunately, I need to do whole genome CNA analysis. As a result, sub-setting would not solve the issue of disk space because after all I need to keep all the subsets on the disk.

ADD REPLYlink written 3.2 years ago by Dataman260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1341 users visited in the last hour