Question: GATK - gene panel VCF - too many variants - how to filter out
2
gravatar for stanedav
2.3 years ago by
stanedav40
stanedav40 wrote:

Hello guys,

I have question about filtering NGS data with GATK workflow. I have data from NGS Illumina (Sureselect kit), it is targeted gene panel (about 112 genes), usually we are using two softwares - Surecall and Nextgene with default settings. Output of this analyses is VCF file with about 500 variants.

Now I am trying to implement GATK workflow from fastq to VCF. (this workflow: https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS)

Basically it is - bam created by bwa-mem, then I remove duplicates and do base recalibration. Then I use Haplotype Caller for creating gvcf and finally VCF.

Last step that I do, is hard filtering raw snps VCF with filters from GATK -

--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"

Problem is, that after this step my vcf has about 40 000 variants passing the filter - do you guys have any idea, where I need to change parameters to get about same amount of variants as running by softwares like SureCall?

Thanks for any tips :)

ngs gatk vcf • 1.7k views
ADD COMMENTlink modified 4 months ago by Biostar ♦♦ 20 • written 2.3 years ago by stanedav40

Hello,

40 000 variants in 112 genes. That's much to much. Can you post the commands you use for the alignment and variant calling? And also the first few entrys of your vcf file are interessting.

fin swimmer

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by finswimmer12k
0
gravatar for stanedav
2.3 years ago by
stanedav40
stanedav40 wrote:

Here is my complete workflow. fastq data are from Sureselect kit:

./bwa mem -M -R \
    @RG\\tID:ID_1234\\tSM:SM_1234\\tLB:LB_1234\\tPL:Illumina\\tPI:150 '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    '/1234/1234_1.fastq' '/1234/1234_2.fastq' > '/1234/1234.sam'  

samtools sort -@ 8 \
    -T /path_to_temp/ \
    -o /1234/1234.bam

java -jar build/libs/picard.jar MarkDuplicates \
    INPUT='/1234/1234.bam' \
    OUTPUT='/1234/1234_dedup.bam' \
    METRICS_FILE='/1234/1234_metrics'

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/dbsnp_138.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/1000G_phase1.indels.hg19.vcf' \
    -o '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T PrintReads \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -o '/1234/1234_remdup_recal.bam' \
    -BQSR '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup_recal.bam' \
    -o '/1234/1234_remdup_recal.g.vcf' \
    --emitRefConfidence GVCF \
    --variant_index_type LINEAR \
    --variant_index_parameter 128000

java -jar GenotypeAnalysisTK.jar -T GenotypeGVCFs  \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -o '/1234/1234.vcf' \
    --variant '/1234/1234_remdup_recal.g.vcf' \

java -jar GenomeAnalysisTK.jar -T VariantFiltration \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -V '/1234/1234.vcf' \
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "GATKdef" \
    -o '/1234/1234_filtered.vcf'
ADD COMMENTlink written 2.3 years ago by stanedav40
3

You should use bed flle from SureSelect kit.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \ -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \ -I '/1234/1234_remdup_recal.bam' \ -o '/1234/1234_remdup_recal.g.vcf' \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 -L sureselect.bed

Another question, If you have only one sample then don't need to use --emitRefConfidence GVCF as well and can ignore GenotypeGVCFs step as well.

ADD REPLYlink written 2.3 years ago by always_learning980

Is this just a typo here? Your output file from MarkDuplicates is 1234_dedup.bam but in the next steps you use 1234_remdup.bam as the input file name.

fin swimmer

ADD REPLYlink written 2.3 years ago by finswimmer12k

Yes, just typo, sorry...

ADD REPLYlink written 2.3 years ago by stanedav40
0
gravatar for Robert Sicko
2.3 years ago by
Robert Sicko570
United States
Robert Sicko570 wrote:

targeted gene panel (about 112 genes)

Do you specify a .bed file in SureCall and Nextgene? I'd bet your abundance of variants in your pipeline come from regions outside your target region. You can filter your filtered vcf for your target region using the bed file for your Sureselect kit.

ADD COMMENTlink written 2.3 years ago by Robert Sicko570
0
gravatar for stanedav
2.2 years ago by
stanedav40
stanedav40 wrote:

Bed file solved the problem! You are best guys!

ADD COMMENTlink written 2.2 years ago by stanedav40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1666 users visited in the last hour