Speeding up WGS analysis
1
0
Entering edit mode
8 months ago
Trivas ★ 1.8k

I'm working with WGS data for the first time to identify the location of large INDELs. I did some searching and came up with a general pipeline. My question is, are there newer/better tools to speed this up, in terms of parallelization or just faster algorithms? Any nextflow pipelines (didn't see any on nfcore)? When each FASTQ file is 100gb, even running fastqc took longer than I expected.

Pipeline:

  • FASTQC
  • Trim with bbduk
  • Map with bbmap
  • Use samtools to take unmapped reads
  • 2nd bbmap step to map against large insertion
  • Samtools flags to identify the insertion locus
WGS • 862 views
ADD COMMENT
0
Entering edit mode

When each FASTQ file is 100 GB

That's ... gigantic. Is that uncompressed?

ADD REPLY
0
Entering edit mode

That's compressed and per read; so total ~200gb paired end. This is human genome data at >=50X coverage.

ADD REPLY
0
Entering edit mode

Shouldn't you use an established pipeline for detection of indels that at least has been benchmarked? What you do is intuitive, but how do you know about accuracy? There are large SV callers out there, what about them, just curious?

ADD REPLY
0
Entering edit mode

Please point me to one! I've been looking on and off for a while and haven't found an established pipeline. The steps I found are largely from other biostars posts such as Trimming adapter sequences - is it necessary? and Identification of the sequence insertion site in the genome.

I've also found https://github.com/kensung-lab/INSurVeyor which could be useful but is downstream from read mapping or https://bioinfo.uth.edu/VirusFinder/ which would co-opt some viral integration site pipelines.

ADD REPLY
0
Entering edit mode

@Brian had posted a way to identify insertions here: Identification of the sequence insertion site in the genome

You could give it a try.

ADD REPLY
0
Entering edit mode
8 months ago
GenoMax 147k

I like that you are already using BBMap :-)

You could pipe the trimming/mapping in one step and write unmapped reads at the same time. Untested but you can try (this interleaves the reads temporarily while they are passed to bbmap)

bbduk.sh -Xmx6g in1=file.R1.fastq.gz in2=file.R2.fastq.gz  out=stdout.fastq (bbduk options) | bbmap.sh -XmxNNg (options for bbmap) in=stdin.fastq out=aligned.bam outu1=unaligned.R1.fastq.gz outu2=unaligned.R2.fastq.gz 

You could also try aligning the unmapped reads from step 1 directly to insertion (not sure if it will work, you will have to try)

bbduk.sh -Xmx6g in1=file.R1.fastq.gz in2=file.R2.fastq.gz  out=stdout.fastq (bbduk options) | bbmap.sh -XmxNNg (options for bbmap) in=stdin.fastq out=aligned.bam outu=unaligned.fastq | bbmap.sh -XmxNNg (options for second alignmant) in=stdin.fastq out=aligned_insertion.bam outu1=unaligned.R1.fastq.gz outu2=unaligned.R2.fastq.gz 

I assume you are using multiple threads with BBMap and that would be best option.

While you could look for other tools (DRAGEN is likely going to be the fastest, if you can use an FPGA) the gains may outweigh having to work on a new pipeline.

All the steps above could be brute-force parallelized (even FastQC could be combined using MultiQC), if you have the hardware, by splitting input into chunks of whatever number of reads and then merging the resulting files.

ADD COMMENT
0
Entering edit mode

Thanks! Good to know I can grab the unmapped reads in the bbmap step.

Genuine question: why does everyone seem to prefer BBMap for WGS alignment? Coming from the RNA-seq world the aligner doesn't seem to matter much.

ADD REPLY
0
Entering edit mode

BBMap is as performant as any aligner out there. Unfortunately since it is not published it does not seem to get as much attention as STAR and others. Only thing it lacks is inability to create transcriptome mapped BAM files when aligning to genome (STAR does this).

ADD REPLY

Login before adding your answer.

Traffic: 792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6