Low mean of read depth - filtering SNP in VCF file
1
0
Entering edit mode
4.5 years ago
Trevorsky • 0

I have a question, I have two WGS datasets for cattle. I am looking for a data filtering protocol to leave only reliable SNP. I found several publications where the SNP in the immediate neighbourhood was filtered (e. g. in three consecutive positions) and the SNPs that were outside the 3 standard deviations from the mean or median were rejected. I have a very low DP average in my data - DP is about 12 and SD is about 6 - which excludes this method.

I have been looking for information in publications for two weeks now, but it is quite residual and does not lead to any meaningful protocol.

I especially looking for publications or protocols for BASH or vcf/bcftools.

snp vcf filtering • 2.2k views
ADD COMMENT
1
Entering edit mode
4.5 years ago
joneill4x ▴ 160

I would recommend processing your data with GATK and follow their best practices.

Since your species is cattle, you will 'hard-filter' the variant calls. https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set

Once GATK has output the final, hard-filtered VCF file, you can eliminate any genotype call with a Genotype Quality Score less than a specified threshold. We use a threshold of 20 (99% accuracy). The Genotype Quality Score is a value on the Phred scale.

Once we started to use the GATK pipeline, the quality of our variant calls increase dramatically. Here are some example commands:

// prepare reference genome

java -jar picard.jar CreateSequenceDictionary R= F3bR3b.fasta O= F3bR3b.dict

bwa index F3bR3b.fasta

samtools faidx F3bR3b.fasta

// align reads form one sample to the ref genome

bwa mem -M -R '@RG\tID:HA93071\tSM:POOL2' -t 16 HA93071_gen.fasta POOL2_S2_L001_R1_001.fastq.gz POOL2_S2_L001_R2_001.fastq.gz | samtools sort -@ 8 -T temp -o HA93071_sorted.bam

samtools index HA93071_sorted.bam

// skipped the MarkDuplicates step

// skipped the Base Quality Score Recalibration step

// HaplotypeCaller step

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R F3bR3b.fasta -I F3BR3b_sorted.bam --emitRefConfidence GVCF -o F3bR3b.g.vcf -variant_index_type LINEAR -variant_index_parameter 128000 -nct 16

// GenotypeGVCFs step

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R F3bR3b.fasta --variant F3bR3b.g.vcf -o F3bR3b.vcf -nt 1

// Variant Filtration step

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T VariantFiltration -R F3bR3b.fasta -V F3bR3b.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "deleteThis" -o badsMarked.vcf

// Select variants step, only keep SNPs that pass the filtering criteria

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T SelectVariants -R F3bR3b.fasta --variant badsMarked.vcf -o F3bR3b_filtSNPs.vcf -select 'vc.isNotFiltered()' --selectTypeToInclude SNP

// convert final VCF to more readable format, mask any genotype calls with a genotype quality score < 20

perl /mnt/nas/JoesBackup/gatkVCFtoHapMap.pl F3bR3b_filtSNPs.vcf 20 3 F3bR3b.hmp.txt F3bR3b.hmc.txt

ADD COMMENT
0
Entering edit mode

The HaplotypeCaller step does local re-alignment, which increases the quality of the variant calls.
You can use vcftools to filter a VCF file on minimum genotype quality score.

Example:

vcftools --vcf firstTomatoGBS_filt_v2.5.vcf --minGQ 20 --max-missing 0.05 --maf 0.01 --recode --out qualFilt

ADD REPLY
0
Entering edit mode

joneill4x - Thanks for your answer!

But I only have the final vcf files that I want to compare in terms of SNP detection efficiency. I don't have any sequencing files or references. Therefore, I want to rely on the data that I have in the files, and on their basis to conduct a reliable analysis. Maybe you could tell us what values, apart from DP and QUAL, are worth relying on?

ADD REPLY
0
Entering edit mode

GT:AD:DP:GQ:PGT:PID:PL

The GQ field is the genotype quality score. It is a value on the Phred scale. A GQ score of 20 means that there is a 1% chance that GT value is erroneous. A GQ score of 30 means that there is a 0.1% chance that the GT value is erroneous. See https://drive5.com/usearch/manual/quality_score.html

One can filter their VCF based on GQ , so that only high-probability GT values remain.

Example:

vcftools --vcf firstTomatoGBS_filt_v2.5.vcf --minGQ 20 --max-missing 0.05 --maf 0.01 --recode --out qualFilt

We often filter with a min DP value of 5 and a min GQ value of 20.

ADD REPLY

Login before adding your answer.

Traffic: 2199 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6