If something reads wrong assume error on my part:
I am working with a code-base that is a somewhat niche wrapper around vcftools
There is a line of code as follows:
vcftools --vcf <vcf_subset_with_headers_generated_by_tabix> \
--geno-r2-positions "<positions_file>" \
--ld-window 500 \
--out <output_name>
The vcftools manual mentions this
--ld-window <integer>
This optional parameter defines the maximum number of SNPs between the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.
Out of curiosity I opened my output file produced by the above code/command and see that the first line of the output file is this
Chr01 78285061 Chr01 78240548 305 0.000272532
so loci one is 78285061 and loci two is 78240548
and if I then do:
and with tabix tabix <vcf_subset_with_headers_generated_by_tabix> Chr01:78240548-78285061 | wc -l
the output is 516
What explains the discrepancy between 500 and 516 here?
I suspect tabix <vcf_subset_with_headers_generated_by_tabix> Chr01:78240548-78285061 | wc -l
might not be the right way to count "number of SNPs between the SNPs being tested".
- Is line index not the right way to count the number of SNPs?
- Is the
--ld-window
flag a bit lax with the way it applies the limit? - Is a different data field from
vcftools
used to calculate the number of SNPs between two positions?
I am fairly sure I am missing something here but don't quite know what. Any help is appreciated.