Question

Merge BAM files from (totally) different references

0

Entering edit mode

9 months ago

predeus ★ 1.9k

Hi all,

I was wondering if there is a tool that would allow merging BAM files from totally different references. (in case you wonder how could this possibly be useful: if you have a large reference that takes a long time to build, splitting it into e.g. 16 smaller chunks speeds things up dramatically).

The way I see it, 3 things need to be adjusted: 1) header, which is trivial; 2) flags and mapping qualities (single mapper can become a multi-mapper); 3) additional tags that track multi-mapping, like NH, HI, etc. The latter two are fairly hard to do effectively, so if a know tool exists, it would be nice to give it a try.

Any suggestions would be appreciated, as always.

-- Alex

bam ngs mapping • 1.5k views

ADD COMMENT • link 8 months ago by predeus ★ 1.9k

0

Entering edit mode

2) flags and mapping qualities (single mapper can become a multi-mapper); 3) additional tags that track multi-mapping, like NH, HI,

Is the same dataset aligned to two references? If not, there would be no way to do this based on alignments in the files. Even if the same dataset is aligned to two references it would be a stretch to fiddle with the alignments to do the above.

May be best to bite the bullet and re-do the alignments with merged reference.

ADD REPLY • link 9 months ago by GenoMax 142k

0

Entering edit mode

It's the exact same dataset, yes. The problem is, this is a part of a pipeline, and the reference is variable (e.g. 2-3k bacterial genomes). Indexing and mapping separately would have been wonderful, but the merging part seems complicated.

ADD REPLY • link 9 months ago by predeus ★ 1.9k

0

Entering edit mode

Ouch so this is 2-3K separate BAM files? Since these genomes will share short sequences the multi-mapping situation will get out of hand quickly if you want to merge all these files.

ADD REPLY • link 9 months ago by GenoMax 142k

0

Entering edit mode

I was thinking ~ 16 chunks of few hundred genomes - this usually works nice and smooth. So, only 16 BAM files.

ADD REPLY • link 9 months ago by predeus ★ 1.9k

0

Entering edit mode

If you split the reads into chunks, and aligned to the same reference genome, that is fine.

But if my understanding is correct, you aligned the same reads to multiple reference genomes. How are you going to deal with the same read could have multiple primary alignment to multiple different genomes? Are you going to pick one based on mapping quality, and randomly break ties, and manually modify MAPQ?

ADD REPLY • link 9 months ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

I need to do read counting with featureCounts after, so, from what I realise, all I need is the correct NH tag.

ADD REPLY • link 9 months ago by predeus ★ 1.9k

0

Entering edit mode

Since you mention featureCounts it is not going to count multi-mapping reads by default. What kind of counting are you planning to do? Are you planning to count the multimappers (with -M switch)? Which aligner are you using and does it mark multi-mappers in hundreds of locations (bbmap.sh with ambig=all explicitly does that but may need to be tested with hundreds of genomes).

ADD REPLY • link 9 months ago by GenoMax 142k

0

Entering edit mode

Yep, I'm using hisat2 with heavily modified presets, and -M -O --fraction for featureCounts. It's reporting alignments in many (up to 2k in my case) locations, and I even had to re-compile featureCounts to stop it from rounding the counts because anything less than 1/100th was regarded as 0. All of that works OK, just need it to faster.

ADD REPLY • link 8 months ago by predeus ★ 1.9k

score 1 · Answer 1 · 2023-08-04

1

Entering edit mode

9 months ago

colindaven 6.4k

I think the whole approach is invalid. You're withholding information from the mappers, which will completely screw everything up, MQ, mapping positions, # of mapped reads, actually everything. Don't do this (!).

To optimize your pipeline can't you use a tool which doesn't need to build a reference (minimap2 maps all read types and fits the bill) - and also subdivide your fastqs to improve mapping speed and later merge bams ? Nextflow has some nice methods to subdivide fastqs if you're using it.

If you really want to optimize mapping speed per core, look at hyperfine on github, it's an excellent benchmarking tool - TLDR - use 16-20 cores for mapping processes (at least on our hardware), ROI decreases markedly after this.

ADD COMMENT • link 9 months ago by colindaven 6.4k

0

Entering edit mode

Why would it screw up _everything_? Don't see how mapping positions or a number of mapped reads (to a particular part of the reference) would change because it would be processed in a split way.

Mapping quality and multi-mapping would indeed be messed up. However, I am not sure I need that information for the downstream processing.

Splitting fastq is an obvious solution, however, it really is not what I need: the problem I have is that reference needs to be generated anew for every run, and it takes quite a long time since it's always single core. Parallel scaling in hisat2 (this is the mapper I'm using) is quite good from what I remember, so I don't think splitting reads is necessary.

Minimap2 does seem like an interesting option to try, thank you for the suggestion.

ADD REPLY • link 9 months ago by predeus ★ 1.9k

0

Entering edit mode

Yep, mapping positions will change, because a small(?) fraction of reads from genome C will be mapped to genome A or genome B if genome C is present in the sample reads, but not in the reference.

You can try this easily using a couple of test experiments with different numbers of genomes in your reference. If you plan on a 16fold ref split, have one complete ref, and one which is just 1/16 of the size, since this is what you are proposing. Results will be dramatically different.

Why do you need to generate the ref anew for every run ? If it is the same reference, can you not just copy/link it in from another location ?

ADD REPLY • link 8 months ago by colindaven 6.4k

0

Entering edit mode

I'm not using the mode where only the best mode is reported - I'm using an approximation of -a mode (-k 10000), the way it's implemented in bowtie2 and hisat2. I doubt the results should be very different in this mode - basically, if something passes the reporting threshold, it should be reported. If anything, mapping to a split reference should be a tiny bit more sensitive, if I understand the algorithm correctly.

Why do you need to generate the ref anew for every run ? If it is the same reference, can you not just copy/link it in from another location ?

Because if I make the full reference it won't fit into any memory. No, it's no the same reference, it's a new reference for every sample.

ADD REPLY • link 8 months ago by predeus ★ 1.9k