Entering edit mode
                    4.2 years ago
        Mehmet
        
    
        ▴
    
    820
    Dear All,
I have done variant calling in Germline data that has single sample of each individual and two genes.
I did following steps, but after checking results I found too many variants.
After Haplotypecaller (the step 6) I found 140900 known variants, and the steps 7 and 8 didn't filter any variants.
What should be causing this result?
1. index the reference fasta (hg19.fa)
    bwa index hg19.fa
2. mapping to the reference genome (hg19.fa)
   I randomly gave read group information. 
    bwa mem -t 16 -R '@RG\tID:21002\tSM:Sample_21002' hg19.fa ../../Trim_final/21002.1.FASTP.fastq.gz  ../../Trim_final/21002.2.FASTP.fastq.gz > 21002.bam
3. picard sort bam file by queryname
     java -jar ~/applications/picard-tools-2.23.0/picard.jar SortSam I=21002.bam  O=21002.bam.PICARD.output.sorted.queryname.bam SORT_ORDER=queryname
4. mark duplicates by MarkDuplicatesSpark
    gatk --java-options "-Xmx5g" MarkDuplicatesSpark -I 21002.bam.PICARD.output.sorted.queryname.bam -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam --tmp-dir ./Test/Tmps/
5. Base (Quality Score) Recalibration
    Tools involved: BaseRecalibrator, Apply Recalibration
    ##first run BaseRecalibrator and later run ApplyBQSR using a table produced by BaseRecalibrator and a bam file produced at the step4.
    ###make an index of dbsnp_138.hg19.vcf
    gatk IndexFeatureFile -F  dbsnp_138.hg19.vcf ## this will generate dbsnp_138.hg19.vcf.idx 
    ##### add read group and platform to bam files;
    java -jar ~/applications/picard-tools-2.23.0/picard.jar AddOrReplaceReadGroups I=21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam  O=21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam RGID=21002  RGLB=21002lib1 RGPL=ILLUMINA RGPU=21002 RGSM=20 
#####BaseRecalibrator
    gatk BaseRecalibrator -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam  --known-sites dbsnp_138.hg19.vcf  --known-sites ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.recal_data.table
  #####ApplyBQSR
    gatk ApplyBQSR -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam --bqsr-recal-file 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.recal_data.table -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam
6. HaplotypeCaller (single sample mode)
gatk HaplotypeCaller --bam-output 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.HaplotypeCaller.OUTPUT.bam --native-pair-hmm-threads 64 -R hg19.fa -I 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz -D dbsnp_138.hg19.vcf
7. CNNScoreVariants
      gatk CNNScoreVariants -R hg19.fa -V 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf
8. FilterVariantTranches
       gatk --java-options "-Xmx7g" FilterVariantTranches -V 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf --resource ../Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.EDITED.gz --resource ../dbsnp_138.hg19.vcf --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O 21002.bam.PICARD.output.sorted.queryname.bam.MarkDuplicatesSpark.OUTPUT.bam.read.groud.bam.ApplyBQSR.OUTPUT.bam.HaplotypeCaller.output.NO.GVCF.option.vcf.gz.CNNScoreVariants.OUTPUT.vcf.FilterVariantTranches.OUTPUT.vcf
OK. I have figured it out. If we specify intervals of gene or genes of interest at
Haplotypecalleror atVariantFiltrationrun like-L chr4:100,232,323-101,343,433 -L chr15:104,543,211-104,403,129(randomly typed intervals here for example), number of final variants are greatly reduced because we keep variants that we want to search at gene or genes of interest.this post is also a solution to set intervals at
HaplotypecallerorVariantFiltrationsteps that were asked and probably will be asked for other people.