Question

workflow for variant calling using GATK in multiple paired end reads

1

Entering edit mode

7.4 years ago

reza ▴ 300

I have four fastq file that correspond two paired end lane (lane1: L1R1.fastq, L1R2.fastq and lane2:L2R1,L2R2). Mapping to reference performed using below command via “bwa mem”.

./bwa mem –M –t 4 ref.fasta L1R1.fq L1R2.fq > D1.sam

./bwa mem –M –t 4 ref.fasta L2R1.fq L2R2.fq > D2.sam

Now I have 2 sam files and my final goal is variant calling (SNP, indel) and variant annotation for my non model organism.

I want to convert sam to bam (and sorting) using Picard or samtools. Which mentioned programs do you recommend? Sorting is need in this step?
I want to define read group for two bam files separately (using Picard) and then merg them to one bam file (big.bam).
sorting big.bam file. Sorting is need if I sorted two bam files (before merging) in step 1?
marking duplicates using Picard tool.
building bam index and then Create Realignment Targets using GATK and finally variant calling.

Mentioned workflow is standard and correct way to reaching to the aim? in my workflow, definition Read Group and merging bam files done in right steps? I read the workflow https://gencore.bio.nyu.edu/variant-calling-pipeline/, but it worked with one bam file without need to definition Read Group.

Additional questions : D1.sam (147 G) and D2 (145 G) are big files and merging them will create a very large file that handling it is hard in my opinion. GATK and Picard can handle it in 32G RAM computer?

snp next-gen • 3.7k views

ADD COMMENT • link updated 6.1 years ago by 570932004 ▴ 10 • written 7.4 years ago by reza ▴ 300

0

Entering edit mode

Alternatively, you can merge the fastq files per read direction. Try to work with bam files instead of sam files, those take far less space.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

i added two different RG to 2 bam files correspondin to one sample and then merged them in one bam file (Using Picard), but size of merged bam file (70 G) is less than sum of two original bam files (bam1= 48 G and bam2= 49 G). why? everything is right?

ADD REPLY • link 7.4 years ago by reza ▴ 300

0

Entering edit mode

That's possible. BAM is compressed, merging bams with similar sequences can lead to better compression, especially if sorted.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

score 1 · Answer 1 · 2018-03-26

Hi, you workflow seems OK. Here are my suggestions on steps 1 through 3 :

Both samtools and picard can sort sam and output bam in one run. GATK recommends the use of picard tools for data pre-processing to produce analysis-ready bam files. Sorting is essential to keeping memory consumption low during downstream analysis.
picard AddOrReplaceReadGroups is what you need. Merging bam is an optional step since GATK can accept merged or seperate bam files, which is why it needs reads groups to distinguish reads from different samples. Many other variant-calling tools however see each bam file as one sample, making merge a NOT-to-do process.
If I remember correctly, merging sorted bams will automatically produce a sorted big.bam. You could use samtools view to check to sam-header in big.bam just to be sure.

picard is not a memory-hungry program, so won't be a problem. GATK may experience high memory usage in certain cases but typically 32G ram is enough.

Finally , I'd recommend you read the full docs from GATK, specifically the "Best Practices", which has very detailed explanation for each step. I say this because bwa has an option "-R" to add RG on-the-fly, so you don't need to go through the whole sam file again just to add RG. The example you mentioned reading did this and I think you may have missed it. Also, there are extra steps which are specific to certain cases (e.g SplitNCigarReads for RNA-seq data or Base Quality Score Recalibration for model species) which may or may not improve your final variant quality. Spending an extra 30 minutes before conducting the analysis may save you hours if not days of troubleshoot and runtime.