Question: GATK - gene panel VCF - too many variants - how to filter out
2
gravatar for stanedav
22 months ago by
stanedav30
stanedav30 wrote:

Hello guys,

I have question about filtering NGS data with GATK workflow. I have data from NGS Illumina (Sureselect kit), it is targeted gene panel (about 112 genes), usually we are using two softwares - Surecall and Nextgene with default settings. Output of this analyses is VCF file with about 500 variants.

Now I am trying to implement GATK workflow from fastq to VCF. (this workflow: https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS)

Basically it is - bam created by bwa-mem, then I remove duplicates and do base recalibration. Then I use Haplotype Caller for creating gvcf and finally VCF.

Last step that I do, is hard filtering raw snps VCF with filters from GATK -

--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"

Problem is, that after this step my vcf has about 40 000 variants passing the filter - do you guys have any idea, where I need to change parameters to get about same amount of variants as running by softwares like SureCall?

Thanks for any tips :)

ngs gatk vcf • 1.4k views
ADD COMMENTlink modified 22 months ago • written 22 months ago by stanedav30

Hello,

40 000 variants in 112 genes. That's much to much. Can you post the commands you use for the alignment and variant calling? And also the first few entrys of your vcf file are interessting.

fin swimmer

ADD REPLYlink modified 22 months ago • written 22 months ago by finswimmer11k
0
gravatar for stanedav
22 months ago by
stanedav30
stanedav30 wrote:

Here is my complete workflow. fastq data are from Sureselect kit:

./bwa mem -M -R \
    @RG\\tID:ID_1234\\tSM:SM_1234\\tLB:LB_1234\\tPL:Illumina\\tPI:150 '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    '/1234/1234_1.fastq' '/1234/1234_2.fastq' > '/1234/1234.sam'  

samtools sort -@ 8 \
    -T /path_to_temp/ \
    -o /1234/1234.bam

java -jar build/libs/picard.jar MarkDuplicates \
    INPUT='/1234/1234.bam' \
    OUTPUT='/1234/1234_dedup.bam' \
    METRICS_FILE='/1234/1234_metrics'

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/dbsnp_138.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/1000G_phase1.indels.hg19.vcf' \
    -o '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T PrintReads \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -o '/1234/1234_remdup_recal.bam' \
    -BQSR '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup_recal.bam' \
    -o '/1234/1234_remdup_recal.g.vcf' \
    --emitRefConfidence GVCF \
    --variant_index_type LINEAR \
    --variant_index_parameter 128000

java -jar GenotypeAnalysisTK.jar -T GenotypeGVCFs  \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -o '/1234/1234.vcf' \
    --variant '/1234/1234_remdup_recal.g.vcf' \

java -jar GenomeAnalysisTK.jar -T VariantFiltration \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -V '/1234/1234.vcf' \
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "GATKdef" \
    -o '/1234/1234_filtered.vcf'
ADD COMMENTlink written 22 months ago by stanedav30
3

You should use bed flle from SureSelect kit.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \ -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \ -I '/1234/1234_remdup_recal.bam' \ -o '/1234/1234_remdup_recal.g.vcf' \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 -L sureselect.bed

Another question, If you have only one sample then don't need to use --emitRefConfidence GVCF as well and can ignore GenotypeGVCFs step as well.

ADD REPLYlink written 22 months ago by always_learning960

Is this just a typo here? Your output file from MarkDuplicates is 1234_dedup.bam but in the next steps you use 1234_remdup.bam as the input file name.

fin swimmer

ADD REPLYlink written 22 months ago by finswimmer11k

Yes, just typo, sorry...

ADD REPLYlink written 22 months ago by stanedav30
0
gravatar for Robert Sicko
22 months ago by
Robert Sicko570
United States
Robert Sicko570 wrote:

targeted gene panel (about 112 genes)

Do you specify a .bed file in SureCall and Nextgene? I'd bet your abundance of variants in your pipeline come from regions outside your target region. You can filter your filtered vcf for your target region using the bed file for your Sureselect kit.

ADD COMMENTlink written 22 months ago by Robert Sicko570
0
gravatar for stanedav
22 months ago by
stanedav30
stanedav30 wrote:

Bed file solved the problem! You are best guys!

ADD COMMENTlink written 22 months ago by stanedav30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1866 users visited in the last hour