Question: Does BWA mem ever discard or modify data
0
gravatar for e.benn
10 days ago by
e.benn0
e.benn0 wrote:

I could not find a definite answer to this question. Does BWA mem modify the data it receives? I understand that it adds data (alignment of course, and more), but am wondering if the data in the fastq is guaranteed to survive.

The program takes in read names, a sequence string, and a quality string. Does it ever modify any of those?

The reason I ask is I am looking to compress our fastq data by running BWA, discarding as many unneeded sam columns and tags, and compressing to CRAM. I have run various tests, and it seems to be working fine, but wanted to know for sure.

[ed: clarify question and goals]

bwa • 184 views
ADD COMMENTlink modified 6 days ago • written 10 days ago by e.benn0
2

It is not the right question - BWA produces a SAM format - so the question should be "Is the sequence information in a SAM format identical to the FASTQ sequence" - to which the answer is NO. The SAM format represents information on the forward strand, hence sequences mapping on the reverse strand will be reverse complemented in the SAM format.

Perhaps the question is whether the FASTQ information could always be reconstructed from the SAM file - to which is the answer is again that not always. If the alignment is reported hard clipped then the clipped sequence will not be in the SAM file.

ADD REPLYlink written 10 days ago by Istvan Albert ♦♦ 77k
1

running BWA, discarding all but the name, sequence and quality

So why don't you archive fastq in the first place? I don't see the reason to align and then discard the work the aligner has done. If you really want to store unaligned CRAM instead of fastq then you can convert fastq to unaligned SAM with e.g. picard/FastqToSam.

ADD REPLYlink written 10 days ago by dariober9.2k

discarding all but the name, sequence and quality, and compressing with CRAM

In fact, he can't do what he wants, because CRAM is an alignment format, and the compression is based on the alignment to the reference genome. If he discards everything but the name, sequence and quality, he can't save as CRAM.

ADD REPLYlink written 10 days ago by h.mon15k

Yes very good, I also need to keep the alignment, and the 'which pair-end' metadata, and probably some more I have forgotten.

ADD REPLYlink written 6 days ago by e.benn0
3
gravatar for h.mon
10 days ago by
h.mon15k
Brazil
h.mon15k wrote:

I believe you can keep all information from a fastq file if you use bwa mem with default parameters. The resulting sam will contain the unmapped reads, and only one copy of the mapped reads will contain the "primary" flag - additional mappings will be marked as "secondary". By default, bwa does soft clipping of primary reads, so all sequence and quality is kept. And even if original fastq read names are edited - removing /1 and /2, or whatever other convention used - the reads will contain the "first in pair" and "second in pair" flags, so read names can be recovered in full.

A great tool to check my claim is BamHash.

edit: this is not to say storing reads as CRAM files is a good idea, but some people think for long-term storage indeed it is. See CRAM vs BAM for references, and for some benchmarks:

https://www.uppmax.uu.se/support/user-guides/using-cram-to-compress-bam-files/

http://www.htslib.org/benchmarks/CRAM.html

ADD COMMENTlink modified 6 days ago • written 10 days ago by h.mon15k
5
gravatar for andrew.j.skelton73
10 days ago by
London
andrew.j.skelton735.0k wrote:

What you might be looking for is uBAM - This is used in GATK's production pipeline to attach metadata to samples as early as possible, however the Broad's scale of data that they're analysing is far more than most places. The biggest caveat to this approach is outlined by Brian Bushnell (Author of bbmap), on that thread, in that gzipped fastq is much more practical in terms of (de)compression efficiency and resulting storage footprint, not to mention that by forcing your reads to the SAM spec, you may be losing original read names, and that can cause problems with tools that assume the original illumina read names (ie, paired end data could be parsed as single end).

I was thinking along the same lines as you a few years ago, and opted against uBAM and stuck with gzipped fastq, however saying that, I recall a thread about loseless fastq compression here. The tool (Alapy) that @Petr Ponomarenko outlined is available here, and there's a fantastic discussion on the thread of benchmarking and feature requests, worth a read.

Additional methodologies out there are binary fastq as @chris.bird mentioned:

  • @John's uQ tool
  • @Brian Bushnell's Clumpify tool (from bbmap)

Overall, my advice is to avoid uBAM, and check out lossless compression methods like alapy, uQ, or Clumpify, and see what best fits your needs, as there will be tradeoffs in terms of memory usage, and time to execute per file.

ADD COMMENTlink modified 10 days ago • written 10 days ago by andrew.j.skelton735.0k

Thanks Andrew for that detailed answer, I will have a close look at all this

ADD REPLYlink written 10 days ago by e.benn0

No matter what compressed storage format you use, the majority of the entropy in sequence data is actually in base quality scores, not the actual sequences. If you want 50%+ improvement over fq.gz then you'll be forced to discard information - base quality score binning will make a bigger difference to your storage footprint that your choice between fq.gz, bam and cram. Whether this level of loss is acceptable depend on your requirements.

ADD REPLYlink written 5 days ago by d-cameron1.8k
0
gravatar for chris.bird
10 days ago by
chris.bird0
chris.bird0 wrote:

it can clip bases, you'll want to relax the defaults

reads could also be filtered if they don't map to the reference genome

given that a bam file has more info than a fastq file, this probably won't work

search on binary fastq for a better solution

ADD COMMENTlink modified 10 days ago • written 10 days ago by chris.bird0

it can clip bases, you'll want to relax the defaults

The bases are soft clipped so the information is still there. Only supplementary alignments are hard clipped but the information is in the primary alignment anyway.

reads could also be filtered if they don't map to the reference genome

Unmapped reads are retained in the bwa output. They just have the read unmapped SAM flag set.

given that a bam file has more info than a fastq file, this probably won't work

bwa alignment does not remove any data from the input fastq. Even the record ordering is retain in the bwa output.

ADD REPLYlink written 5 days ago by d-cameron1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 942 users visited in the last hour