Hi there. I'm working with VCF files and I've noticed a peculiarity: using the following commands:
bcftools view -S sample_A_list.txt input.vcf > sample_A.vcf
and bcftools view -S sample_B_list.txt input.vcf > sample_B.vcf
, I've created these 2 VCF files deriving from input.vcf
, splitting it into 2 different samples (A and B). sample_A.vcf
has 27 samples and sample_B.vcf
has 116 samples. Now if I run egrep -v "^#"sample_A.vcf | wc -l
or egrep -v "^#"sample_B.vcf | wc -l
in order to have the number of SNPs for each VCF, I collect the same result: 5997 SNPs for both files. Then I pruned the SNPs in linkage using the plink
pipeline (R 0.4) in order to get 2 new VCF via the recode
function. Running the same bash command to get the total number of SNPs post pruning I get values that are totally different: sample_B.vcf
(with 116 samples) post LD-pruning passes from 5997 SNPs to something more than 2000, while sample_A.vcf
(with 27 samples) post LD-pruning passes from 5997 SNPs to something more than 500.
So the first question is: if I have 2 different VCFs, for which reason I've got the same SNPs total number? The second questions is: for what reason I have this huge difference in SNPs number post LD-pruning between my 2 files? Thank you for the answers and the help.
Ok, so if I correctly understand I have to do this step for each VCF (
sample_A.vcf
andsample_B.vcf
) generated fromview
function.