##### Recalibrate variant quality scores = run VQSR

Question

Too many variant calls from GATK pipeline using joint calling

1

Entering edit mode

6.3 years ago

Argossy ▴ 50

I have exome sequencing with 69 human samples and used the GATK pipeline with joint calling. But got too many variant calls after the joint calling (~2 million calls per subject) before filtering. even if after the filtering using VQSR and GQ > 20, I still have ~500 k of variant calls. anything wrong in my pipeline?

below is my pipeline:

sequencing snp • 2.1k views

ADD COMMENT • link updated 6.3 years ago by Pierre Lindenbaum 161k • written 6.3 years ago by Argossy ▴ 50

0

Entering edit mode

id=$id_input
dirin=$dir_input

bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L002_R1_001.fastq.gz $dirin/$rawName_L002_R2_001.fastq.gz > $id_L002_001.sam
bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L002_R1_002.fastq.gz $dirin/$rawName_L002_R2_002.fastq.gz > $id_L002_002.sam
bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L001_R1_001.fastq.gz $dirin/$rawName_L001_R2_001.fastq.gz > $id_L001_001.sam
bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L001_R1_002.fastq.gz $dirin/$rawName_L001_R2_002.fastq.gz > $id_L001_002.sam

# 3. Convert .sam to .bam and sort
line=$id_L002_001
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

line=$id_L002_002
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

line=$id_L001_001
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

line=$id_L001_002
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

# 4. Add read group
line=$id_L002_001
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

line=$id_L002_002
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

line=$id_L001_001
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

line=$id_L001_002
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

# 5. Merge two different lanes
id1=$id_L002_001
id2=$id_L002_002
id3=$id_L001_001
id4=$id_L001_002
#id=0199-1-B
samtools merge $id.merge.bam $id1.sorted.rg.bam $id2.sorted.rg.bam $id3.sorted.rg.bam $id4.sorted.rg.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$id.merge.bam OUTPUT=$id.merge.sorted.bam SORT_ORDER=coordinate
samtools index $id.merge.sorted.bam

# 6. Dedup
#id=0199-1-B
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar MarkDuplicates INPUT=$id.merge.sorted.bam OUTPUT=$id.merge.sorted.dedup.bam  METRICS_FILE=$id.merge.sorted.dedup.metrics.txt PROGRAM_RECORD_ID= MarkDuplicates PROGRAM_GROUP_VERSION=null PROGRAM_GROUP_NAME=MarkDuplicates
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar BuildBamIndex INPUT=$id.merge.sorted.dedup.bam

## remove the temp files
rm -rf $id.merge.sorted.bam
rm -rf $id.merge.sorted.bam.bai

#7. Realign
#id=0199-1-B
gatk -T RealignerTargetCreator -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.bam -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o $id.merge.sorted.dedup.target_intervals.list
gatk -T IndelRealigner -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.bam -targetIntervals $id.merge.sorted.dedup.target_intervals.list -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o $id.merge.sorted.dedup.realigned.bam

# 8. Recalibrate
#id=0199-1-B
gatk -T BaseRecalibrator -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.realigned.bam -knownSites dbsnp_138.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites 1000G_phase1.indels.hg19.sites.vcf -o $id.merge.sorted.dedup.realigned.recal_data.table

gatk -T PrintReads -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.realigned.bam -BQSR $id.merge.sorted.dedup.realigned.recal_data.table -o $id.merge.sorted.dedup.realigned.recal.bam
samtools index $id.merge.sorted.dedup.realigned.recal.bam

## 9. GATK HaplotypeCaller
#id=0199-1-B
gatk -R ucsc.hg19.fasta -T HaplotypeCaller -I $id.merge.sorted.dedup.realigned.recal.bam --dbsnp dbsnp_138.hg19.vcf --emitRefConfidence GVCF -o raw_variant.$id.g.vcf

ADD REPLY • link updated 6.3 years ago by GenoMax 141k • written 6.3 years ago by Argossy ▴ 50

score 0 · Answer 1 · 2018-01-11

0

Entering edit mode

6.3 years ago

Pierre Lindenbaum 161k

 --emitRefConfidence GVCF

You're creating a GVCF, that is not a VCF (https://gatkforums.broadinstitute.org/gatk/discussion/4017/), you need to call GenotypeGVFs

ADD COMMENT • link 6.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I actually perform joint calling:

gatk -T GenotypeGVCFs -R ucsc.hg19.fasta -V raw_variant.0107-2-1.g.vcf -V raw_variant.0107-2-A.g.vcf -V raw_variant.0107-2-B.g.vcf ...

and then

##### Recalibrate variant quality scores = run VQSR

### 1. Prepare recalibration parameters for SNPs

### 2. Build the SNP recalibration model

gatk -T VariantRecalibrator -R ucsc.hg19.fasta -input variant_raw_joint.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R

### 3. Apply the desired level of recalibration to the SNPs in the call set

gatk -T ApplyRecalibration -R ucsc.hg19.fasta -input variant_raw_joint.vcf -mode SNP --ts_filter_level 99.5 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o recalibrated_snps_raw_indels.vcf

### 4. Prepare recalibration parameters for Indels

### 5. Build the Indel recalibration model

gatk -T VariantRecalibrator -R ucsc.hg19.fasta -input recalibrated_snps_raw_indels.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R

#### 6. Apply the desired level of recalibration to the Indels in the call set

gatk -T ApplyRecalibration -R ucsc.hg19.fasta -input recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -o recalibrated_variants.vcf

ADD REPLY • link 6.3 years ago by Argossy ▴ 50