GATK quality recalibration to get recalibrated BAM files before SNP calling
1
1
Entering edit mode
4.7 years ago
alicia.poppy ▴ 50

I'm having problems with quality score recalibration using GATK as most tutorials/examples use an old version of GATK which has different syntax and arguments to the current version.

I have run this to get a recalibration table, which I think I need to get the recalibrated BAM files:

gatk-4.1.2.0/gatk BaseRecalibrator \
-reference human_genome/hg38_masked.fa \
--known-sites human_genome/All_20180418.vcf.gz \
--input S01_BAM_file.bam \
--output recal.table

This works fine, and I could run this for all of my samples.

But when I then do the next step to get the recalibrated BAM files I can't work out what to do.

*The old way of doing it was:

 java –jar GenomeAnalysisTK.jar –T PrintReads \ 
–R human.fasta \
–I realigned.bam \ 
–BQSR recal.table \ 
–o recal.bam*

I wrote:

gatk-4.1.2.0/gatk PrintReads \
-R data_R3/human_genome/hg38_masked.fa \
--input S01_BAM_file.bam \
-**BQSR** recal.table \
 --output quality_score_recalibrated_S01.bam

But this makes an error that says: "A USER ERROR has occurred: B is not a recognized option". And I can't find an alternative argument for -BQSR when I look at the PrintReads help manual.

Does anyone know the current way to run this quality score recalibration to get recalibrated BAM files?

GATK BQSR recalibration SNP • 5.6k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
4
Entering edit mode
4.7 years ago

you should use ApplyBQSR :

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_ApplyBQSR.php

gatk BaseRecalibrator \
   -I input.bam \
   -R reference.fasta \
   --known-sites sites_of_variation.vcf \
   --known-sites another/optional/setOfSitesToMask.vcf \
   -O recal_data.table

then

 gatk ApplyBQSR \
   -R reference.fasta \
   -I input.bam \
   --bqsr-recal-file recal_data.table \
   -O output.bam
ADD COMMENT
0
Entering edit mode

Hi! @Nicolas Rosewick.Your comment was helpful. However, when I tried, your command was giving me the following error:

A USER ERROR has occurred: '--bqsr_recal_file' is not a valid command.
I tried the below-mentioned command and it worked.
gatk ApplyBQSR \
   -R reference.fasta \
   -I input.bam \
   --bqsr recal_data.table \
   -O output.bam

So I would point using bqsr in small case rather than the syntax in which mentioned on the GATK website.

ADD REPLY
0
Entering edit mode

Have you found a resolution? I am having the same issue doing it the way Nicholas suggested. Thanks

ADD REPLY
0
Entering edit mode

Could you post your command and gatk version ?

ADD REPLY
0
Entering edit mode

UPDATE:

I'm not sure what I did differently honestly, but I now have output files, but they are very small so I know the script was not successful.

I now have a .bam, .bai and a .sbi file.

Here is my code

gatk BaseRecalibratorSpark \
    --reference ref.genome.fa \
    --input filename.bam \
    --known-sites KnownSites.vcf \
    --output recal_data.table

and then I follow up with:

gatk ApplyBQSRSpark \
        --reference ref.genome.fa \
        -I filename.bam \
        --bqsr-recal-file recal_data.table \
        -O output.bam \

I was getting this error:

A USER ERROR has occurred: Argument output was missing: Argument 'output' is required.

**********************************************************************
Using GATK jar /gpfs01/tools/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
 line 51: --input: command not found
 line 55: --output: command not found
ADD REPLY
0
Entering edit mode

Do you have a reason to use the spark version ? Did you try the “standard” version of BQSR ?

ADD REPLY
0
Entering edit mode

Yes, I was using the standard version but it takes ~2 weeks to run exome files. (Is that customary for you?) I am trying to process ~300 more in the next 8 weeks. I am looking for a quicker alternative and I read abou Spark this morning. Will compare the files in the end.

ADD REPLY
0
Entering edit mode

An exome file should run under a day for sure (in my case it takes 2-3 hours on my local cluster depending on the number of reads). Could you tell me more on the computational resource you have ?

ADD REPLY
0
Entering edit mode

Yes, for sure, it will definitely run in under a day. My files are averaging ~5-6 hours. I expect that because I use a node that is shared. If belongs to another PI; if their lab is using it, my job becomes paused. I am using a computing node with 28 cores, 126 GB of memory.

Consider that I have in total 500 files. I have process so far ~200, and I have another 300 left. Basically, it will take me 20-30 more days than I have left in my lab, and I have a hard deadline.

Additionally, if I could make this process easier for lab mates coming behind me, that would be a plus too!

Any insight on this would be extremely appreciated.

ADD REPLY
0
Entering edit mode

you don’t have access to an other cluster (your institution maybe has a bigger one ? ). One node with 28 cores is clearly not a lot to process 500 exomes.. however BQSR use only one core per sample. So you could in theory process 28 samples in parallel thus dropping the compute time for 300 samples to ~65h. Of course you will use all cores of the node...

ADD REPLY
0
Entering edit mode

I'd like to continue this conversation via email, if possible. Can I email you?

ADD REPLY
0
Entering edit mode

@rohitsatyam102 It’s --bqsr-recal-file not --bqsr_recal_file

ADD REPLY

Login before adding your answer.

Traffic: 1229 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6