I like that you are already using BBMap :-)
You could pipe the trimming/mapping in one step and write unmapped reads at the same time. Untested but you can try (this interleaves the reads temporarily while they are passed to bbmap)
bbduk.sh -Xmx6g in1=file.R1.fastq.gz in2=file.R2.fastq.gz out=stdout.fastq (bbduk options) | bbmap.sh -XmxNNg (options for bbmap) in=stdin.fastq out=aligned.bam outu1=unaligned.R1.fastq.gz outu2=unaligned.R2.fastq.gz
You could also try aligning the unmapped reads from step 1 directly to insertion (not sure if it will work, you will have to try)
bbduk.sh -Xmx6g in1=file.R1.fastq.gz in2=file.R2.fastq.gz out=stdout.fastq (bbduk options) | bbmap.sh -XmxNNg (options for bbmap) in=stdin.fastq out=aligned.bam outu=unaligned.fastq | bbmap.sh -XmxNNg (options for second alignmant) in=stdin.fastq out=aligned_insertion.bam outu1=unaligned.R1.fastq.gz outu2=unaligned.R2.fastq.gz
I assume you are using multiple threads with BBMap and that would be best option.
While you could look for other tools (DRAGEN is likely going to be the fastest, if you can use an FPGA) the gains may outweigh having to work on a new pipeline.
All the steps above could be brute-force parallelized (even FastQC could be combined using MultiQC), if you have the hardware, by splitting input into chunks of whatever number of reads and then merging the resulting files.
That's ... gigantic. Is that uncompressed?
That's compressed and per read; so total ~200gb paired end. This is human genome data at >=50X coverage.
Shouldn't you use an established pipeline for detection of indels that at least has been benchmarked? What you do is intuitive, but how do you know about accuracy? There are large SV callers out there, what about them, just curious?
Please point me to one! I've been looking on and off for a while and haven't found an established pipeline. The steps I found are largely from other biostars posts such as Trimming adapter sequences - is it necessary? and Identification of the sequence insertion site in the genome.
I've also found https://github.com/kensung-lab/INSurVeyor which could be useful but is downstream from read mapping or https://bioinfo.uth.edu/VirusFinder/ which would co-opt some viral integration site pipelines.
@Brian had posted a way to identify insertions here: Identification of the sequence insertion site in the genome
You could give it a try.