Question: GATK; Variant calling RNA-Seq best practices: bam processing steps
1
gravatar for user230613
18 months ago by
user230613270
Europe
user230613270 wrote:

Hi folks,

I have been looking for the best approach for call the variants in RNA-Seq data. GATK proposes the following workflow:

1) Map the reads using STAR.

2) Remove duplicates

3) Split N Trim

4) Indel realignment

5) Base recalibration

6) HaplotypeCaller

My questions are:

  • Is there any study showing the improvement of the variant calling after applying all the BAM processing steps? Let's say, the accuracy of the calls when you run HaplotypeCaller (or other variant caller) using a "raw" bam file or post-processed bam file.
  • Split N Trim: Is this step only required when using STAR as mapper (to fix MQ 255 issue)?

Thank you in advance,

rna-seq variant_calling gatk • 1.0k views
ADD COMMENTlink modified 18 months ago by andrew.j.skelton735.5k • written 18 months ago by user230613270
2
gravatar for andrew.j.skelton73
18 months ago by
London
andrew.j.skelton735.5k wrote:

Is there any study showing the improvement of the variant calling after applying all the BAM processing steps? Let's say, the accuracy of the calls when you run HaplotypeCaller (or other variant caller) using a "raw" bam file or post-processed bam file.

The GATK RNA Seq best practises were never published, but turned into a poster as far as I remember, however the steps are completely logical and make sense:

  • Map using STAR - STAR is arguably the best performing aligner around, and it certainly was at the time this protocol was conceived.
  • Remove Duplicates - RNA Seq data will have a lot of repeats in there as it's a quantification experiment after all, so removing duplicates is useful to reduce down the dataset.
  • Split N Trim - An RNA specific step that hard clips trailing bases (reduces FP rate a lot), fixes the Q255 issue, and splits the read into exon segments.

When comparing to a "Raw Alignment", the extra steps certainly improve things. Check out this page which in some cases shows the consequence of not performing a step (see section 3 for example). If you didn't do duplicate marking, you'd get a very large spike in your FP rate most likely. If you don't split N trim, it would cause a lot of downstream issues with the CIGAR string, Q255 values, etc.

Indel realignment is kind of the black sheep here, as since the introduction of the HaplotypeCaller and the deprecation of the UnifiedGenotyper, this may not improve things drastically. The HaplotypeCaller will perform graph based assembly, which should fix issues that the UnifiedGenotyper would miss. So this step is optional.

BQSR will improve things and remove systematic artefacts of quality scores.

Overall, yes, these steps certainly help and there's consequences for not following them. See the link above for extra detail, it's a very well explained protocol.

ADD COMMENTlink written 18 months ago by andrew.j.skelton735.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1290 users visited in the last hour