Question: Advice On Pre-Processing/Filtering Ngs Data
gravatar for DonJoe
7.9 years ago by
DonJoe30 wrote:


I'm currently working on pre-processing some whole genome and exome sequencing data and could use some tips for my current pipeline. I use GATK for most of the steps and have looked at their recommended best practice but I'm still not sure about some things.

The order I'm doing stuff is now:

  1. [Picard]: Merge BAM files if the sample has been run on several lanes.

  2. [GATK]: Re-align around indels using RealignerTargetCreator followed by IndelRealigner. These walkers are supplied with VCFs containing known indels, provided in the GATK bundle.

  3. [GATK]: Recalibrate base quality scores using BaseRecalibrator followed by PrintReads with the -BQSR option. BaseRecalibrator is supplied with a VCF containing known sites, provided in the GATK bundle.

In step 2 i use the following filters:

  • MappingQualityFilter -mmq 40 (Require mapping quality 40 or higher)
  • DuplicateReadFilter
  • FailsVendorQualityCheckFilter
  • UnmappedRead
  • MappingqualityUnavailableFilter
  • MappingQualityZero
  • BadMateFilter

What I'm unsure of is if I should do that much filtering in step 2 since the recalibrated base qualities affect the mapping quality (right?). I'm thinking it might be better to skip the filters altogether in step 2 and instead filter in step 3 after the base recalibration is done. That way, the filtering is done on more accurate scores and the end result should be more reliable.

What do you guys think? Any tips are greatly appreciated!

gatk picard pipeline filtering • 3.6k views
ADD COMMENTlink written 7.9 years ago by DonJoe30
gravatar for frostzwerg
7.9 years ago by
frostzwerg30 wrote:


I am using GATK as well, performing Indel realignment and base quality recalibration. But still, I am not an expert! ;-) As "base quality recalibration" already suggests, it recalibrates the quality score of the bases and NOT of the reads (i.e. the mapping quality). The base quality recalibration tries to eradicate sequencing errors provided by the machine. E.g., in my opinion, there is no need to compute the "HomopolymerCovariate" for Illumina reads, as Illumina's base calling procedure does not make trouble on calling repeating bases... Am I right here?!

The Indel realignment tries to eradicate erros done while mapping the reads to the reference. Some Indels (usually the smaller ones) won't be called in every read, and SNPs may be incorporated as they are "more likely" in this read (usually towards the end of a read..). Remember, mapping to the reference is done for every read individually! The Indel realignment looks at the collectivity of the reads supporting the position, where an Indel have been called (in at least one read). If there are additional reads, which are able to support the Indel, but have SNPs incorporated so far, they will be remapped to contain the Indel as well, if it is possible and makes "sense" (presumably according to some error function).

In summary, this leads to more consitent Indel calls and reduces the positions where you have SNPs and Indels, both in place.

It may depend on your specific data which filtering is useful and which not, but in general, I would recommend to use the majority of the filterings. You just have to try...

Generally speaking, I would suggest that the Indel realignment and the base quality recalibration work independently from each other. I do remapping first and then recalibration as well but may be it won't make any great difference....

Hope that helps a little bit!

ADD COMMENTlink written 7.9 years ago by frostzwerg30

Indeed, the BaseRecalibrator adjusts base scores, but wouldn't that also affect the mapping quality of the reads since it's dependent on the base score? This is assuming mapping quality is calculated as described here:

Anyway, I'll try the filters in different ways as you suggested. There's probably more than one way to do this. Thank you so much for your answer!

ADD REPLYlink written 7.9 years ago by DonJoe30

Yes, you are right, the recalibration of the base quality also affects the mapping quality (but I'm not sure in what dimensions...). Anyway, the mapping quality should not affect the Indel realigner, so the results should be the same, independently of the order in which you perform these corrections...

And I'm pretty sure there is more than one way to do it, that's the challenge! ;-)

ADD REPLYlink written 7.9 years ago by frostzwerg30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1547 users visited in the last hour