Question: How about my parallel bwa+gatk pipeline?
1
gravatar for bluemonster0808
12 months ago by
China
bluemonster080840 wrote:

Normally, we run NGS analysis in the follwing steps:

1、

~/bwa-0.7.16a/bwa mem -t 32 -M -R @RG\tID:1\tPL:Illumina\tSM:HCC1143 ~/reference/human_g1k_v37_decoy.fasta  1.fastq 2.fastq | samblaster --splitterFile my.split.sam --discordantFile 
 my.discord.sam > my.sam

2、

~/sambamba_v0.6.6 view -S -f bam -l 0 my.sam | ~/sambamba_v0.6.6 sort -t 32 -m 60G --tmpdir /data/ -o my.sort.bam /dev/stdin

~/sambamba_v0.6.6 index  my.sort.bam

3、

~/sambamba_v0.6.6 markdup --tmpdir /data/ --overflow-list-size 800000 --remove-duplicates my.sort.bam my.nodup.bam

~/sambamba_v0.6.6 index my.nodup.bam

4、

java -Xmx60G -jar ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar --analysis_type BaseRecalibrator -nct 32 --out my.recal.grp --disable_indel_quals --reference_sequence ~/reference/human_g1k_v37_decoy.fasta --input_file my.nodup.bam --knownSites ~/reference/dbsnp_137.b37.vcf --knownSites ~/reference/1000G_phase1.indels.b37.vcf --knownSites ~/reference/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -L 20

java -Xmx60G -jar ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar --analysis_type PrintReads -nct 32 --out my.base_recalibrated.bam --reference_sequence ~/reference/human_g1k_v37_decoy.fasta --input_file my.nodup.bam --BQSR  my.recal.grp

5、

java -Xmx60G -jar -Djava.io.tmpdir=/data/ ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/reference/human_g1k_v37_decoy.fasta -I my.base_recalibrated.bam --genotyping_mode DISCOVERY -o my_variants.vcf

The five steps above cost 60+ hours on a 32Core 128G machine, for dataset NA12878-Garvan-Vial1_R1.fastq.gz and NA12878-Garvan-Vial1_R2.fastq.gz. I build a parallel pipeline to speed up, and it indeed reduced the timecost to 15hours. And can anybody tells me whether this pipeline is ok or not?

1、 pigz -d *.fastq.gz

2、 split the 2 fastq files into 8 parts seperately. say, NA12878-Garvan-Vial1_R1_0.fastq~NA12878-Garvan-Vial1_R1_7.fastq, NA12878-Garvan-Vial1_R2_0.fastq~NA12878-Garvan-Vial1_R2_7.fastq, each part has almost the same reads(the same text line numbers).

3、 bwa mem |samblaster on 8 virtual machines simultaneously.

4、 use sambamba to get 8 sorted bam on 8 virtual machines simultaneously.

5、 use "bamtools split -reference" to split the 8 sorted bam, and then use "samtools merge" to get serveral bam files, each one is a chromosome. say, REF_1.bam, REF_2.bam ..... REF_22.bam, REF_X.bam, REF_Y.bam, REF_MT.bam. and also many other like REF_GL000249.1.bam

6、 start 8 virtual machines, and each one to markdup some chromosomes' bam.

7、 start 8 virtual machines, and each one to do BQSR on some chromosomes's bam.

8、 start 8 virtual machines, and each one to do HaplotypeCaller on some chromosomes's bam.

pipeline parallel ngs gatk • 811 views
ADD COMMENTlink modified 12 months ago by JJ370 • written 12 months ago by bluemonster080840
1
gravatar for chen
12 months ago by
chen1.7k
OpenGene
chen1.7k wrote:

Your bwa is not in parallel, which is one of the slowest steps. This will be your bottleneck.

Suggest you to split FASTQ first instead of splitting bam files. You can use tools like fastp or slicer to split your fastq files by lines.

ADD COMMENTlink modified 12 months ago • written 12 months ago by chen1.7k

also you have to judge for yourself whether to do base recalibration or not (I do it for now but I only see a negligible effect) ... there was a discussion if you really need it.

ADD REPLYlink written 12 months ago by JJ370

I did split FASTQ files in step 2 "split the 2 fastq files into 8 parts seperately".

And then I call 8 "bwa mem" on 8 virutal machines parallelly.

"Split bam file and merge bam file" is to generate each chromosome's bam file.

ADD REPLYlink written 12 months ago by bluemonster080840
0
gravatar for JJ
12 months ago by
JJ370
JJ370 wrote:

I can only speak of my own experience and bwa is by far not the slowest step (but parallise what you can :) ) - it's GATK base recalibration and HC.

Have a look here - you could use the -nct option also for the HC - I use it and I never had a crash... (GATK says you should use it with caution as crashes can occur)

ADD COMMENTlink modified 12 months ago • written 12 months ago by JJ370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1280 users visited in the last hour