Question

Speeding up WGS analysis

0

Entering edit mode

8 months ago

Trivas ★ 1.8k

I'm working with WGS data for the first time to identify the location of large INDELs. I did some searching and came up with a general pipeline. My question is, are there newer/better tools to speed this up, in terms of parallelization or just faster algorithms? Any nextflow pipelines (didn't see any on nfcore)? When each FASTQ file is 100gb, even running fastqc took longer than I expected.

Pipeline:

FASTQC
Trim with bbduk
Map with bbmap
Use samtools to take unmapped reads
2nd bbmap step to map against large insertion
Samtools flags to identify the insertion locus

WGS • 862 views

ADD COMMENT • link updated 8 months ago by GenoMax 147k • written 8 months ago by Trivas ★ 1.8k

0

Entering edit mode

When each FASTQ file is 100 GB

That's ... gigantic. Is that uncompressed?

ADD REPLY • link 8 months ago by Ram 44k

0

Entering edit mode

That's compressed and per read; so total ~200gb paired end. This is human genome data at >=50X coverage.

ADD REPLY • link 8 months ago by Trivas ★ 1.8k

0

Entering edit mode

Shouldn't you use an established pipeline for detection of indels that at least has been benchmarked? What you do is intuitive, but how do you know about accuracy? There are large SV callers out there, what about them, just curious?

ADD REPLY • link 8 months ago by ATpoint 85k

0

Entering edit mode

Please point me to one! I've been looking on and off for a while and haven't found an established pipeline. The steps I found are largely from other biostars posts such as Trimming adapter sequences - is it necessary? and Identification of the sequence insertion site in the genome.

I've also found https://github.com/kensung-lab/INSurVeyor which could be useful but is downstream from read mapping or https://bioinfo.uth.edu/VirusFinder/ which would co-opt some viral integration site pipelines.

ADD REPLY • link 8 months ago by Trivas ★ 1.8k

0

Entering edit mode

@Brian had posted a way to identify insertions here: Identification of the sequence insertion site in the genome

You could give it a try.

ADD REPLY • link 8 months ago by GenoMax 147k

score 0 · Answer 1 · 2024-02-20

I like that you are already using BBMap :-)

You could pipe the trimming/mapping in one step and write unmapped reads at the same time. Untested but you can try (this interleaves the reads temporarily while they are passed to bbmap)

bbduk.sh -Xmx6g in1=file.R1.fastq.gz in2=file.R2.fastq.gz  out=stdout.fastq (bbduk options) | bbmap.sh -XmxNNg (options for bbmap) in=stdin.fastq out=aligned.bam outu1=unaligned.R1.fastq.gz outu2=unaligned.R2.fastq.gz

You could also try aligning the unmapped reads from step 1 directly to insertion (not sure if it will work, you will have to try)

bbduk.sh -Xmx6g in1=file.R1.fastq.gz in2=file.R2.fastq.gz  out=stdout.fastq (bbduk options) | bbmap.sh -XmxNNg (options for bbmap) in=stdin.fastq out=aligned.bam outu=unaligned.fastq | bbmap.sh -XmxNNg (options for second alignmant) in=stdin.fastq out=aligned_insertion.bam outu1=unaligned.R1.fastq.gz outu2=unaligned.R2.fastq.gz

I assume you are using multiple threads with BBMap and that would be best option.

While you could look for other tools (DRAGEN is likely going to be the fastest, if you can use an FPGA) the gains may outweigh having to work on a new pipeline.

All the steps above could be brute-force parallelized (even FastQC could be combined using MultiQC), if you have the hardware, by splitting input into chunks of whatever number of reads and then merging the resulting files.