FastQC before and after Base Quality Score Recalibration (BQSR)
1
2
Entering edit mode
23 months ago
alicia.poppy ▴ 50

I have performed GATK's Base Quality Score Recalibration (BQSR) on my BAM files.

The BQSR has made the per base sequence quality look very bad on FastQC. Is this normal? Should I use the BAM files that have been through BQSR and therefore have worse FastQC reports?

The FastQC per base sequence quality goes from being in the green zone

to being all in the lower red zone

The next step I will do is SNP calling. This is genomic data.

(The BAM files contain only the data that is paired, matches uniquely, and has the duplicates marked by Picard). The FastQC images are from just one of my samples but all samples have the same pattern of being all green or almost all in the green zone, to being entirely in the red zone after BQSR.

FastQC BQSR GATK SNP • 1.4k views
0
Entering edit mode
0
Entering edit mode

That is extremely weird. How did you run BQSR?

0
Entering edit mode

I did it like this:

###1/2 Quality score recalibration - to get a recalibration table
#loop
for (i in 1:40){
cmd <- paste("gatk-4.1.2.0/gatk BaseRecalibrator",
"--known-sites data_R3/human_genome/All_20180418.vcf.gz",
"--input", RG_BAM_files[i],
"--output", recal_tables[i])
print(cmd)
}

###2/2 Quality score recalibration - to get final BAM files
#loop
for (i in 1:40){
cmd <- paste("gatk-4.1.2.0/gatk ApplyBQSR",
"--input", RG_BAM_files[i],
"--bqsr-recal-file", recal_tables[i],
"--output", Final_BAM[i])
print(cmd)
}


I would then c&p the output of the cmds into my command line. I did this because R uses java12 not java1.8 so it didn't work directly through R.

0
Entering edit mode

That's an... unusual way of running stuff.

0
Entering edit mode

Yes it is annoying, usually I can just do system(cmd) in the loop to have it run on the command line through R. But java was causing problems so I just did it through c&p each print because I couldn't think how else to do it.

0
Entering edit mode

What was your problem with JAVA. BTW, why use R to run a shell-script (BQSR)?

0
Entering edit mode

It is because if I run command line though R Studio using "system(cmd)" it makes an error saying:

Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/hellbender/transformers/ReadTransformer. Method lambda$identity$d67512bf$1(Lorg/broadinstitute/hellbender/utils/read/GATKRead;)Lorg/broadinstitute/hellbender/utils/read/GATKRead; at index 65 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef at org.broadinstitute.hellbender.transformers.ReadTransformer.identity(ReadTransformer.java:30) at org.broadinstitute.hellbender.engine.GATKTool.makePreReadFilterTransformer(GATKTool.java:345) at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:374) at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:93) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205) at org.broadinstitute.hellbender.Main.main(Main.java:291)  I googled it and apparently it's because GATK's BQSR was designed to be run on java 1.8. So there's no error when I run directly on command line when:  java version "1.8.0_212" Java(TM) SE Runtime Environment (build 1.8.0_212-b10) Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)  But there is an error when I when on command line through using system() on R. Because R uses:  java 12.0.1 2019-04-16 Java(TM) SE Runtime Environment (build 12.0.1+12) Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)  I just use R just because I am familiar with it and I like how R Studio works and how scripts can be run easily, whereas I am new to using the command line. ADD REPLY 0 Entering edit mode To solve this, you can always use the complete path of the correct JAVA (inside R or outside also), something like /usr/local/bin/java. To know the correct path, write which java on commandline. ADD REPLY 0 Entering edit mode Did you encounter any error? Could you just run BQSR on a single sample using commandline directly (wihout R)? ADD REPLY 0 Entering edit mode It didn't flag up any errors when I ran them on the command line which is why I was so surprised at the FastQC. For example, I would run this on a sample for the first command to make the table: (base) MacBook-2:Rotation3 alicia$ gatk-4.1.2.0/gatk BaseRecalibrator --reference data_R3/human_genome/hg38_masked.fa --known-sites data_R3/human_genome/All_20180418.vcf.gz --input /Users/alicia/Documents/KCL/Rotation3/data_R3/BAM/BAM_matched_unique_duplicates_marked_RG/Sorted_matched_unique_duplicates_marked_RG_S01.bam --output /Users/alicia/Documents/KCL/Rotation3/data_R3/recal_tables/recal_data_S01.table


The end of the output of this command says this (I would post all the output but it exceeds the character limit):

16:29:32.633 INFO  ProgressMeter - Traversal complete. Processed 3946 total reads in 0.3 minutes.
16:29:32.703 INFO  BaseRecalibrator - Calculating quantized quality scores...
16:29:32.729 INFO  BaseRecalibrator - Writing recalibration report...
16:29:33.370 INFO  BaseRecalibrator - ...done!
16:29:33.370 INFO  BaseRecalibrator - Shutting down engine
[13 June 2019 16:29:33 BST] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 0.47 minutes.
Runtime.totalMemory()=771751936
Tool returned:
3946


For the second command I would write:

(base) MacBook-2:Rotation3 alicia\$ gatk-4.1.2.0/gatk ApplyBQSR --reference data_R3/human_genome/hg38_masked.fa --input /Users/alicia/Documents/KCL/Rotation3/data_R3/BAM/BAM_matched_unique_duplicates_marked_RG/Sorted_matched_unique_duplicates_marked_RG_S01.bam --bqsr-recal-file /Users/alicia/Documents/KCL/Rotation3/data_R3/recal_tables/recal_data_S01.table --output /Users/alicia/Documents/KCL/Rotation3/data_R3/BAM/BAM_matched_unique_duplicates_marked_RG_quality_score_recalibrated/Final_BAM_S01.bam


The end of the output that command (to create the new recalibrated quality BAM files) was:

16:30:30.610 INFO  ProgressMeter - Traversal complete. Processed 152088 total reads in 0.1 minutes.
16:30:30.622 INFO  ApplyBQSR - Shutting down engine
[13 June 2019 16:30:30 BST] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.22 minutes.
Runtime.totalMemory()=381157376

1
Entering edit mode

<4000 reads? My guess is this is the wrong input file

3
Entering edit mode
23 months ago
alicia.poppy ▴ 50

Apparently it is because I only have data for 1 gene and GATK says "A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better" https://software.broadinstitute.org/gatk/documentation/article?id=11081

1
Entering edit mode

1
Entering edit mode

By best knowledge, there is yet a paper missing that clearly shows BSQR notably increases precision of variant calling. You therefore might simply skip it.