Question: Converting large (compressed and unsorted) BAM files to fastq
2
gravatar for alesssia
2.5 years ago by
alesssia520
London, UK
alesssia520 wrote:

Dear all,

we got a few hundred large compressed BAM files (70GB <= size <= 300GB). They are not sorted and we would like to convert them back to fastq, in order to align them with a different algorithm.

We have paired-end reads and were planning to first sort the BAM by read name using sambamba (http://lomereiter.github.io/sambamba/docs/sambamba-sort.html) and then to convert the sorted BAM into fastq using bedtools (http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html). However, while the sorting is relatively fast (about 4h for each file), the conversion is very slow.

Is anyone aware of any other procedure that will make the conversion faster? I've seen that there are alternatives to bedtools such as picard (https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam) and biobambam2 (https://github.com/gt1/biobambam2). Does anyone know the performances of these tools, if a benchmarking has already been performed and/or if there are better tools?

Thank you very much in advance :)

bam alignment fatsq picard bedtools • 2.5k views
ADD COMMENTlink modified 2.5 years ago by Brian Bushnell16k • written 2.5 years ago by alesssia520

Have you tried bam2fastq: https://gsl.hudsonalpha.org/information/software/bam2fastq

It says its no longer supported but still works

ADD REPLYlink written 2.5 years ago by Tonor420

Hi Tonor. Do you know if bam2fastq is faster than Picard's SamToFastq and why it has been discontinued for Picard's SamToFastq (as in the top pf their web page)?

ADD REPLYlink written 2.5 years ago by alesssia520

I've never done benchmarking so not sure which one if faster, I think as Picard tools accomplishes the same task they stopped actively developing it.

I don't think the bam file has to be sorted though for bam2fastq to work

ADD REPLYlink written 2.5 years ago by Tonor420

BAM needs to be name sorted for PE data for bedtools bamtofastq.

ADD REPLYlink written 2.5 years ago by genomax67k

Yes, it does need to be sorted if the reads are paired-end (as specified by both picard and bedtools).

ADD REPLYlink written 2.5 years ago by alesssia520

I am interested in the fastest way to complete the task, but I will add this tool to the list of programs to benchmark --if no one has done this before.

ADD REPLYlink written 2.5 years ago by alesssia520
1

I recently used bam2fastq for single end reads. The problem I found was that reads that had more than one alignment in the bam file ended up being present more than once in the fastq file.

ADD REPLYlink written 2.5 years ago by mastal5112.0k
1

Yes, that's why for the reformat command I added the "primaryonly" flag; without that it has similar behavior.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k
3
gravatar for Brian Bushnell
2.5 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The BBMap package's reformat tool will do bam -> fastq conversion. I just tested it, and it ran at 50 Mbps/s, limited by Samtools conversion (it runs samtools in a subprocess for bam -> sam conversion). Sam.gz -> fastq is slightly faster at 70 Mbp/s. Converting a 2.5 GB sorted, mapped bam file took 88 seconds and produced a 10.5 GB fastq, so that's 120 MB/s for the raw fastq output. Gzipped fastq takes the same amount of time if you have pigz installed.

Usage:

reformat.sh in=reads.bam out=reads.fq primaryonly

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Brian Bushnell16k

Hi Brian, thanks for suggesting BBmap. I am trying to use it but I have a weird problem and perhaps you can help me (hope this comment is the right place to ask).

We have paired-end reads, that have already been sorted by read name using sambamba, and I would like to create two fastq, one for each end. When I run the following command ./reformat.sh in=test.sorted.bam out1=test1.fq out2=test2.fq ow allowidenticalnames addslash primaryonly BBmap is telling me that "Input is being processed as paired" but a java.lang.AssertionError is thrown after the first pair of reads is processed (see below). When I ask for a single fastq with the following command ./reformat.sh in=test.sorted.bam out=test.fq ow allowidenticalnames addslash primaryonly BBmap is confirming that "Input is being processed as unpaired", the fastq is created and each read name is followed by " /1".

Any idea?


The thrown error is:

Input is being processed as paired
Exception in thread "Thread-2" java.lang.AssertionError: H2FF2ALXX:1:1:2:0  2   -1  +   32763188    32763337    1000000000000000010 1   0   1   TCTTTAAAACATTTTAAGGCATTTTACCTCTTAATGATAAGAATTTAAGAAAAACGCTAAAATATCATTATTCTGTCTATAAGATTAAAAGAGATTAATTAATCTAACATGTAGTACATATTAAATATAAATGAAATAACAAAAAAAAGC  A---FJJJJAJJJJJJ<-F-AAA-JJ7---AFJJF<J7JJF---JFFJ7JFA7<---<<--JJAJ-AFJ<J7-----7-<-<F-<-JF-F-7-7------7-----J-<-<-F7<--<---7F-F-------<---------AF7A----  .   23  .   .
H2FF2ALXX:1:1:2:0   163 chr16   32763189    1   115M35S =   32763344    304 TCTTTAAAACATTTTAAGGCATTTTACCTCTTAATGATAAGAATTTAAGAAAAACGCTAAAATATCATTATTCTGTCTATAAGATTAAAAGAGATTAATTAATCTAACATGTAGTACATATTAAATATAAATGAAATAACAAAAAAAAGC  A---FJJJJAJJJJJJ<-F-AAA-JJ7---AFJJF<J7JJF---JFFJ7JFA7<---<<--JJAJ-AFJ<J7-----7-<-<F-<-JF-F-7-7------7-----J-<-<-F7<--<---7F-F-------<---------AF7A----  SM:i:1  AS:i:1  RG:Z:0  NM:i:8  BC:Z:none
at stream.SamReadInputStream.toReadList(SamReadInputStream.java:124)
at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:90)
at stream.SamReadInputStream.hasMore(SamReadInputStream.java:54)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:643)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by alesssia520
2

Reformat produces a file with interleaved reads by default. You can easily separate the two reads into two files by subsequently doing reformat.sh in=test.fq out1=test_R1.fq.gz out2=test_R2.fq.gz

ADD REPLYlink written 2.5 years ago by genomax67k

Thanks for clarifying!

ADD REPLYlink written 2.5 years ago by alesssia520
1

Yeah, Reformat will never interpret a sam/bam file as interleaved no matter what flags you give it, so it will just keep them in the input order and treat them as single-ended. If you want them in two files with the names changed, you can do it in a 2-stage pipe with renaming in the second phase, like this:

reformat.sh in=test.sorted.bam out=stdout.fq primaryonly | reformat.sh in=stdin.fq out1=r1.fq.gz out2=r2.fq.gz interleaved addcolon

"addcolon" is probably better than "addslash" as the " 1:" and " 2:" suffix are produced by more recent versions of Illumina software.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k
1

@Brian: addcolon is not working in v.36.59.

Exception in thread "main" java.lang.AssertionError: Unknown parameter addcolon
    at jgi.ReformatReads.<init>(ReformatReads.java:162)
    at jgi.ReformatReads.main(ReformatReads.java:42)
ADD REPLYlink written 2.5 years ago by genomax67k

Hmm, yeah, looks like I added it in 36.60. But 36.62 is up now.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k

I must admit the conversion in BBmap is pretty fast! Does it support multi-threading? Will it be even faster if I increase the memory? Thanks!

ADD REPLYlink written 2.5 years ago by alesssia520

Reformat is multithreaded. Unfortunately, it's already going as fast as it can go, limited by the bam -> sam conversion of samtools. So, no, increasing memory won't help.

I wrote another program yesterday exclusively for sam/bam -> fastq conversion which is even faster (I clocked it at 109 Mbp/s, so around 50% faster than Reformat sam.gz -> fastq and 120% faster than bam -> fastq). But that's the speed for sam.gz -> fastq, so since you are starting with bam files, it wouldn't be any faster.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Brian Bushnell16k

Too bad it does not work for bam->fastq :(

Anyway, starting from an (unsorted compressed) BAM file of 102G it took about 4.5h to sort the BAM according to read name with sambamba (using 60GB memory and 20 threads) and about 6h to convert the (uncompressed sorted) BAM of 310G in a single fastq and then to split it into two files (forward/reverse) with BBmap reformat. We have a few hundred of them, corresponding to few thousand hours of computation. Does anyone know any trick to speed it up? @Brian, do the BBmap performance decrease if the BAM is not sorted?

Thanks again!

ADD REPLYlink written 2.5 years ago by alesssia520
1

Hmmm... if you are converting to fastq.gz, Reformat will be faster if you have pigz installed. And since Reformat is low-memory and only uses a few threads, you can run multiple instances at once on a 20-core node (probably 5-6 instances).

The performance of Reformat's bam->fastq is limited by samtools. For compressed bam, position-sorted will be substantially faster than unsorted because the decompression should be faster. For uncompressed bam, I'm not really sure; it may not make a difference. You can run top and look at the CPU utilization; if samtools is at 100%, then Reformat can't go any faster no matter what you do.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k

Hi @Brian! I've run the reformat on the unsorted compressed BAM with pigz installed and the processing time halved (from 2 to 1h, processing 264.59k reads/sec vs the previous 100.94k reads/sec). On the top of that, using the unsorted BAM saved me about 4.5h :) My current bottleneck is the second step of the BAM to fastq conversion, where the interleaved reads are separate into two files, that takes about 2.5h (122.25k reads/sec). Any way to speed this up? I know that I am being a bit insatiable here, but you never know :) Thanks!

ADD REPLYlink written 2.5 years ago by alesssia520

UPDATE: when used as input for BBmap reformat, the unsorted compressed BAM generated a non-interleaved fastq that is equivalent (in number of reads and size) to the one created by using the sorted uncompressed BAM. However, according to the script usage: "If input is paired and there is only one output file, it will be written interleaved.". What am I doing wrong?

Also, I suspect that the following command requires the input fastq to be interleaved to work properly, isn't it?

reformat.sh in=unsorted.test.fq out1=unsorted.test.end1.fq out2=unsorted.test.end2.fq

Thanks!

ADD REPLYlink written 2.5 years ago by alesssia520

That command does indeed require interleaved fastq input. As for bam -> fastq, Reformat will simply retain the original ordering and original reads (aside from secondary/supplimentary alignments, which are discarded with the "primaryonly" flag). It does not understand the ordering of the bam file, which is why you need to name-sort the bam file first, so that it will end up interleaved. In other words, the process is like this:

Step 1: Name-sort the bam.

Step 2: Run Reformat on the name-sorted bam, like this:

reformat.sh in=namesorted.bam out=stdout.fq primaryonly | reformat.sh in=stdin.fq out1=r1.fq.gz out2=r2.fq.gz interleaved addcolon

Did you do that, and encounter odd results?

ADD REPLYlink modified 2.5 years ago by genomax67k • written 2.5 years ago by Brian Bushnell16k

@Brian: Plain out=stdout and in=stdin seems to work fine. Is .fq extension needed?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax67k

It's not strictly needed, because fastq is the default for Reformat. But I always like to include it because some other programs have other defaults. For example, "out=stdout" for BBMap will output sam format, and Dedupe will output fasta, I think. Whereas "out=stdout.fq" will always output fastq.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k

Thanks Brian, I hoped I could avoid the time to name-sort the bam file first :)

ADD REPLYlink written 2.5 years ago by alesssia520

Incidentally, the reason I used the pipe in the command reformat.sh in=namesorted.bam out=stdout.fq primaryonly | reformat.sh in=stdin.fq out1=r1.fq.gz out2=r2.fq.gz interleaved addcolon) is to eliminate the need for two reformat steps.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Brian Bushnell16k

Does anyone know any trick to speed it up?

Following may sound like a flippant solution but it works. We have done jobs that would have taken a couple of months on local clusters in a couple of days on Google compute.

If this is time sensitive do the compute on Google compute/Amazon AWS. You can start all jobs at the same time and be done in the time it takes to complete one using multiple VM's.

Otherwise get started with the conversion without wasting any additional time testing other tools :)

ADD REPLYlink written 2.5 years ago by genomax67k

:) I had already started the conversion before posting the initial question, but, you know, following BAMs can be converted faster ;)

Unfortunately, due to some legal/ethical constraints, using external machines is out of the question, so we must make the most of what we have at hand!

ADD REPLYlink written 2.5 years ago by alesssia520

Totally understand.

FYI: Google will sign business associate agreements (perhaps Amazon does too). So it is possible to create a business case (that can fit local security policies) for such use. Google compute is also HIPAA/EU Data protection directive compliant that covers many constraints.

Why do you need to convert the BAM's to fastq, out of curiosity?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax67k

Hi Brian, so I tried this:

reformat.sh in=test.bam out=stdout.fq  | reformat.sh in=stdin.fq out1=r.1.fq.gz out2=r.2.fq.gz interleaved addcolon

however, I've been getting this error message:

Input is being processed as unpaired
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag and no ScafMap loaded.
This can normally be resolved by adding the flag ref=file, where file is the fasta file to which the reads were mapped.

is there any way to solve this? I'm using ver: BBMap_38.08 thanks.

ADD REPLYlink written 11 months ago by simplitia30
0
gravatar for Ron
2.5 years ago by
Ron930
United States
Ron930 wrote:

You can use HYDRA. Its pretty fast.

https://code.google.com/archive/p/hydra-sv/downloads

I have used this version Hydra.v0.5.3.tar.gz

Something like this should work:

Hydra-Version-0.5.3/bin/bamToFastq -bam  input.bam -fq1 out.R1.fq  -fq2 out.R2.fq
ADD COMMENTlink written 2.5 years ago by Ron930

If you say it is pretty fast... I will give it a try! Thanks!

ADD REPLYlink written 2.5 years ago by alesssia520
1

Hi again Ron! Looking at the number of bytes of the fastq file written in the same time interval (that is a quite rough way to look into performances) by both bedtools and HYDRA I would say that their speed is quite comparable, the latter being slightly but consistently slower.

ADD REPLYlink written 2.5 years ago by alesssia520
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1976 users visited in the last hour