Question: FastQC before and after Base Quality Score Recalibration (BQSR)
2
gravatar for alicia.poppy
11 months ago by
alicia.poppy40
alicia.poppy40 wrote:

I have performed GATK's Base Quality Score Recalibration (BQSR) on my BAM files.

The BQSR has made the per base sequence quality look very bad on FastQC. Is this normal? Should I use the BAM files that have been through BQSR and therefore have worse FastQC reports?

The FastQC per base sequence quality goes from being in the green zone https://ibb.co/hyv5DQy

to being all in the lower red zone

https://ibb.co/ckN1fFw

The next step I will do is SNP calling. This is genomic data.

(The BAM files contain only the data that is paired, matches uniquely, and has the duplicates marked by Picard). The FastQC images are from just one of my samples but all samples have the same pattern of being all green or almost all in the green zone, to being entirely in the red zone after BQSR.

fastqc snp bqsr gatk • 752 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by alicia.poppy40

How to add images to a Biostars post

ADD REPLYlink written 11 months ago by ATpoint35k

That is extremely weird. How did you run BQSR?

ADD REPLYlink written 11 months ago by Santosh Anand5.1k

I did it like this:

###1/2 Quality score recalibration - to get a recalibration table
#loop 
for (i in 1:40){
  cmd <- paste("gatk-4.1.2.0/gatk BaseRecalibrator",
               "-reference data_R3/human_genome/hg38_masked.fa",
               "--known-sites data_R3/human_genome/All_20180418.vcf.gz",
               "--input", RG_BAM_files[i],
               "--output", recal_tables[i])
  print(cmd)
}

###2/2 Quality score recalibration - to get final BAM files
#loop
for (i in 1:40){
  cmd <- paste("gatk-4.1.2.0/gatk ApplyBQSR",
               "-reference data_R3/human_genome/hg38_masked.fa",
               "--input", RG_BAM_files[i],
               "--bqsr-recal-file", recal_tables[i],
               "--output", Final_BAM[i])
  print(cmd)
}

I would then c&p the output of the cmds into my command line. I did this because R uses java12 not java1.8 so it didn't work directly through R.

ADD REPLYlink modified 11 months ago • written 11 months ago by alicia.poppy40

That's an... unusual way of running stuff.

ADD REPLYlink written 11 months ago by WouterDeCoster43k

Yes it is annoying, usually I can just do system(cmd) in the loop to have it run on the command line through R. But java was causing problems so I just did it through c&p each print because I couldn't think how else to do it.

ADD REPLYlink written 11 months ago by alicia.poppy40

What was your problem with JAVA. BTW, why use R to run a shell-script (BQSR)?

ADD REPLYlink written 11 months ago by Santosh Anand5.1k

It is because if I run command line though R Studio using "system(cmd)" it makes an error saying:

Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/hellbender/transformers/ReadTransformer. Method lambda$identity$d67512bf$1(Lorg/broadinstitute/hellbender/utils/read/GATKRead;)Lorg/broadinstitute/hellbender/utils/read/GATKRead; at index 65 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef
    at org.broadinstitute.hellbender.transformers.ReadTransformer.identity(ReadTransformer.java:30)
    at org.broadinstitute.hellbender.engine.GATKTool.makePreReadFilterTransformer(GATKTool.java:345)
    at org.broadinstitute.hellbender.engine.GATKTool.getTransformedReadStream(GATKTool.java:374)
    at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:93)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)

I googled it and apparently it's because GATK's BQSR was designed to be run on java 1.8.

So there's no error when I run directly on command line when:

  java version "1.8.0_212"
    Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
    Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)

But there is an error when I when on command line through using system() on R. Because R uses:

 java 12.0.1 2019-04-16
    Java(TM) SE Runtime Environment (build 12.0.1+12)
    Java HotSpot(TM) 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

I just use R just because I am familiar with it and I like how R Studio works and how scripts can be run easily, whereas I am new to using the command line.

ADD REPLYlink modified 11 months ago • written 11 months ago by alicia.poppy40

To solve this, you can always use the complete path of the correct JAVA (inside R or outside also), something like /usr/local/bin/java. To know the correct path, write which java on commandline.

ADD REPLYlink written 11 months ago by Santosh Anand5.1k

Did you encounter any error? Could you just run BQSR on a single sample using commandline directly (wihout R)?

ADD REPLYlink written 11 months ago by Santosh Anand5.1k

It didn't flag up any errors when I ran them on the command line which is why I was so surprised at the FastQC.

For example, I would run this on a sample for the first command to make the table:

(base) MacBook-2:Rotation3 alicia$ gatk-4.1.2.0/gatk BaseRecalibrator --reference data_R3/human_genome/hg38_masked.fa --known-sites data_R3/human_genome/All_20180418.vcf.gz --input /Users/alicia/Documents/KCL/Rotation3/data_R3/BAM/BAM_matched_unique_duplicates_marked_RG/Sorted_matched_unique_duplicates_marked_RG_S01.bam --output /Users/alicia/Documents/KCL/Rotation3/data_R3/recal_tables/recal_data_S01.table

The end of the output of this command says this (I would post all the output but it exceeds the character limit):

16:29:32.633 INFO  ProgressMeter - Traversal complete. Processed 3946 total reads in 0.3 minutes.
16:29:32.703 INFO  BaseRecalibrator - Calculating quantized quality scores...
16:29:32.729 INFO  BaseRecalibrator - Writing recalibration report...
16:29:33.370 INFO  BaseRecalibrator - ...done!
16:29:33.370 INFO  BaseRecalibrator - Shutting down engine
[13 June 2019 16:29:33 BST] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 0.47 minutes.
Runtime.totalMemory()=771751936
Tool returned:
3946

For the second command I would write:

(base) MacBook-2:Rotation3 alicia$ gatk-4.1.2.0/gatk ApplyBQSR --reference data_R3/human_genome/hg38_masked.fa --input /Users/alicia/Documents/KCL/Rotation3/data_R3/BAM/BAM_matched_unique_duplicates_marked_RG/Sorted_matched_unique_duplicates_marked_RG_S01.bam --bqsr-recal-file /Users/alicia/Documents/KCL/Rotation3/data_R3/recal_tables/recal_data_S01.table --output /Users/alicia/Documents/KCL/Rotation3/data_R3/BAM/BAM_matched_unique_duplicates_marked_RG_quality_score_recalibrated/Final_BAM_S01.bam

The end of the output that command (to create the new recalibrated quality BAM files) was:

16:30:30.610 INFO  ProgressMeter - Traversal complete. Processed 152088 total reads in 0.1 minutes.
16:30:30.622 INFO  ApplyBQSR - Shutting down engine
[13 June 2019 16:30:30 BST] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.22 minutes.
Runtime.totalMemory()=381157376
ADD REPLYlink written 11 months ago by alicia.poppy40
1

<4000 reads? My guess is this is the wrong input file

ADD REPLYlink written 11 months ago by Asaf7.6k
3
gravatar for alicia.poppy
11 months ago by
alicia.poppy40
alicia.poppy40 wrote:

Apparently it is because I only have data for 1 gene and GATK says "A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better" https://software.broadinstitute.org/gatk/documentation/article?id=11081

ADD COMMENTlink written 11 months ago by alicia.poppy40
1

Upvoted. You may accept your own answer to close this thread

ADD REPLYlink modified 11 months ago • written 11 months ago by Santosh Anand5.1k
1

By best knowledge, there is yet a paper missing that clearly shows BSQR notably increases precision of variant calling. You therefore might simply skip it.

ADD REPLYlink written 11 months ago by ATpoint35k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1337 users visited in the last hour