So many variants detected.
Entering edit mode
3 months ago
Mehmet ▴ 720

Dear All,

I have done variant calling in Germline data that has single sample of each individual and two genes.

I did following steps, but after checking results I found too many variants.

After Haplotypecaller (the step 6) I found 140900 known variants, and the steps 7 and 8 didn't filter any variants.

What should be causing this result?

1. index the reference fasta (hg19.fa)

    bwa index hg19.fa

2. mapping to the reference genome (hg19.fa)
   I randomly gave read group information. 

    bwa mem -t 16 -R '@RG\tID:21002\tSM:Sample_21002' hg19.fa ../../Trim_final/21002.1.FASTP.fastq.gz  ../../Trim_final/21002.2.FASTP.fastq.gz > 21002.bam

3. picard sort bam file by queryname

     java -jar ~/applications/picard-tools-2.23.0/picard.jar SortSam I=21002.bam  O=21002.bam.PICARD.output.sorted.queryname.bam SORT_ORDER=queryname

4. mark duplicates by MarkDuplicatesSpark

    gatk --java-options "-Xmx5g" MarkDuplicatesSpark -I 21002.bam.PICARD.output.sorted.queryname.bam -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam --tmp-dir ./Test/Tmps/

5. Base (Quality Score) Recalibration

    Tools involved: BaseRecalibrator, Apply Recalibration

    ##first run BaseRecalibrator and later run ApplyBQSR using a table produced by BaseRecalibrator and a bam file produced at the step4.

    ###make an index of dbsnp_138.hg19.vcf

    gatk IndexFeatureFile -F  dbsnp_138.hg19.vcf ## this will generate dbsnp_138.hg19.vcf.idx 

    ##### add read group and platform to bam files;

    java -jar ~/applications/picard-tools-2.23.0/picard.jar AddOrReplaceReadGroups I=21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam RGID=21002  RGLB=21002lib1 RGPL=ILLUMINA RGPU=21002 RGSM=20 


    gatk BaseRecalibrator -R hg19.fa -I  --known-sites dbsnp_138.hg19.vcf  --known-sites ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz -O


    gatk ApplyBQSR -R hg19.fa -I --bqsr-recal-file -O

6. HaplotypeCaller (single sample mode)

gatk HaplotypeCaller --bam-output --native-pair-hmm-threads 64 -R hg19.fa -I -O -D dbsnp_138.hg19.vcf

7. CNNScoreVariants

      gatk CNNScoreVariants -R hg19.fa -V -O

8. FilterVariantTranches

       gatk --java-options "-Xmx7g" FilterVariantTranches -V --resource ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz --resource ../dbsnp_138.hg19.vcf --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O
gatk4 variants • 228 views
Entering edit mode

OK. I have figured it out. If we specify intervals of gene or genes of interest at Haplotypecaller or at VariantFiltration run like -L chr4:100,232,323-101,343,433 -L chr15:104,543,211-104,403,129 (randomly typed intervals here for example), number of final variants are greatly reduced because we keep variants that we want to search at gene or genes of interest.

this post is also a solution to set intervals at Haplotypecaller or VariantFiltration steps that were asked and probably will be asked for other people.


Login before adding your answer.

Traffic: 1752 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6