I'm currently working on pre-processing some whole genome and exome sequencing data and could use some tips for my current pipeline. I use GATK for most of the steps and have looked at their recommended best practice but I'm still not sure about some things.
The order I'm doing stuff is now:
[Picard]: Merge BAM files if the sample has been run on several lanes.
[GATK]: Re-align around indels using RealignerTargetCreator followed by IndelRealigner. These walkers are supplied with VCFs containing known indels, provided in the GATK bundle.
[GATK]: Recalibrate base quality scores using BaseRecalibrator followed by PrintReads with the -BQSR option. BaseRecalibrator is supplied with a VCF containing known sites, provided in the GATK bundle.
In step 2 i use the following filters:
- MappingQualityFilter -mmq 40 (Require mapping quality 40 or higher)
What I'm unsure of is if I should do that much filtering in step 2 since the recalibrated base qualities affect the mapping quality (right?). I'm thinking it might be better to skip the filters altogether in step 2 and instead filter in step 3 after the base recalibration is done. That way, the filtering is done on more accurate scores and the end result should be more reliable.
What do you guys think? Any tips are greatly appreciated!