Question

Find common SNPs between high and low depth sequence data

1

Entering edit mode

4.4 years ago

evelyn ▴ 230

Hello All,

I have a vcf file for multiple samples made using bcftools with joint variant calling for high depth dataset. And another similar vcf file for multiple samples made using bcftools with joint variant calling for low depth dataset. I want to know which SNP positions are common among these two vcf files for high vs low datasets.

And if there is a common position between two vcf files, does the common SNP calls match on that position or not. For example, if POS 5786 is common between both vcf files, does the REF and ALT calls arthe e same in both cases for that position?

I am not aware of the options available to find such information. I will appreciate any help! Thank you!

snp bcftools vcf • 681 views

ADD COMMENT • link updated 4.4 years ago by zx8754 11k • written 4.4 years ago by evelyn ▴ 230

score 2 · Answer 1 · 2019-11-27

2

Entering edit mode

4.4 years ago

JC 13k

Hi there,

VCFs only indicates the differences from the reference genome, also, only marks the variation observed, so you can have cases like:

VCF 1:

##CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A1 A2 A3
chr1  123456  .  G  A  50  PASS  .  GT  0|1  0|0   1|1
chr1  123459  .  T  C  50  PASS  .  GT  0|0  0|0   0|1

But in the other VCF2:

##CHROM POS ID REF ALT QUAL FILTER INFO FORMAT B1 B2 B3
chr1  123456  .  G  T  50  PASS  .  GT  0|1  0|0   1|1
chr1  123490  .  G  C  50  PASS  .  GT  0|0  0|0   0|1

therefore, you have 2 different calls in chr1:123456 and chr1:123459 doesn't exist in VCF2 (maybe reference homozygous or just low coverage to do the call)

You need to be aware of all of this, also GATK GVCF is better to combine data from dirfferent individuals or populations

ADD COMMENT • link 4.4 years ago by JC 13k

0

Entering edit mode

Hello,

Thank you for your guidance. I have used this:

gatk Concordance \
   -R ref.fa \
   -eval low_coverage.vcf \
   --truth high_coverage.vcf \
   --summary summary.tsv

The output summary.tsv shows:

type    TP  FP  FN  RECALL  PRECISION
SNP 95612107    8193277 39372182    0.708   0.921
INDEL   0   0   0   0.0 0.0

I am confused about the output. If I understand correctly, TP shows the overlapping (common) number of SNPs in two files. How can I also get the statistics about the agreement of SNP calls out of the total number of common SNPs among both files e.g., A at the same position in both files?

Thank you!

ADD REPLY • link 4.4 years ago by evelyn ▴ 230