Different relatedness estimates by PLINK and VCFTOOLS despite same method
0
0
Entering edit mode
11 weeks ago
8armed • 0

According to the vcftools manual, specifying the "--relatedness2" flag allows calculating relatedness statistics using the method by Manichaikul et al., BIOINFORMATICS 2010 (doi:10.1093/bioinformatics/btq559). That is, based on KING. According to the PLINK manual, PLINK uses the same method to calculate relatedness when specifying the flag "--make-king-table". So, although both PLINK and vcftools seem to be using the same algorithm, the name of the outputted relatedness statistics differs. In the vcftools output, it's called "RELATEDNESS_PHI", whereas in PLINK it's called "KINSHIP".

More worrying to me is, however, that the relatedness estimates markedly differ between PLINK and vcftools. How is this possible given that they apparently rely on the same method? Below a graph that shows relatedness estimates calculated with both programs based on the same vcf input file.

Comparison of PLINK vs vcftools

I also checked whether results change when I run relatedness analyses on a raw vcf file outputted GATK's GenotypeGVCFs which I just mildly filtered with vcftools. But estimates still differ between PLINK and vcftools.

Notably, vcftools throws me some warnings when running the relatedness analysis (see below). Yet, the actual relatedness estimates I get from vcftools make biologically more sense than the estimates I get from PLINK. The ones from PLINK are markedly too high.

*## Warning for vcftools:
Parameters as interpreted:
vcftools
--vcf INPUT.vcf
--out OUTPUT
--relatedness2
Warning: Expected at least 2 parts in FORMAT entry: ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
Warning: Expected at least 2 parts in FORMAT entry: ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
Warning: Expected at least 2 parts in FORMAT entry: ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
Warning: Expected at least 2 parts in FORMAT entry: ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
After filtering, kept 418 out of 418 Individuals
Outputting Individual Relatedness
After filtering, kept 13357 out of a possible 13357 Sites*
kinship vcftools PLINK relatedness • 725 views
ADD COMMENT
1
Entering edit mode

Please post an actual example (could contain just two samples and a handful of SNPs) allowing someone else to reproduce the discrepancy you're describing.

ADD REPLY
0
Entering edit mode

Thanks for following up on my problem! Here an exemplary vcf file containing SNP data (1567 SNPs) from 10 individuals. How can I best attach/post this vcf file here for someone else to cross-check?

Here what I do:

Using vcftools:

vcftools --vcf PATH/Example.vcf \ --relatedness2 \ --out PATH/Example

-> Here I get the Warnings as mentioned above

Using PLINK:

1) Create input for Plink

plink --vcf PATH/Example.vcf \ --allow-extra-chr \ --make-bed \ --out PATH/Example

2) Plink does not tolerate chromosome names that are not human chromosomes. Thus, I'm replacing the first column by 1

awk '{$1="0";print $0}' PATH/Example.bim > PATH/Example.bim.tmp mv PATH/Example.bim.tmp PATH/Example.bim

3) Calculate kinship

plink2 --bfile PATH/Example \ --allow-extra-chr \ --make-king-table \ --out PATH/Example

Below a screenshot of part of the output. I colored some of the same pairwise comparisons. As you can see, the relatedness estimates differ.

Outputs

ADD REPLY
0
Entering edit mode
  • I was imagining that you could boil this down to an example small enough ("just two samples and a handful of SNPs") to be posted directly here; you have shown no reason why this isn't possible. But if you don't know how to do that, my email address can be found on the plink2 website.

  • Have you tried the KING software (https://www.kingrelatedness.com/Download.shtml )? I just performed a quick test on a bit of 1000 Genomes data; plink2 and KING agreed, vcftools (0.1.16) --relatedness2 disagreed. Reading the vcftools documentation closely, it seems that they only claim a method “based on” the KING paper, they seem to have made a nonstandard choice along the way without documenting it.

ADD REPLY
0
Entering edit mode

Thanks for this! – This is indeed quite worrying, because the estimates do differ quite substantially and many relations that would be deemed clear first degree relations in PLINK are, for example, deemed only 2nd degree relations in vcftools. It's also worrying because both implementations (PLINK and vcftools) are used quite frequently in the scientific literature.

If someone has another/better answer to this conundrum, please let me know!

ADD REPLY

Login before adding your answer.

Traffic: 1365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6