I'm trying to convert a VCF of ~15K samples (~4K families) into PLINK binary format to check for any pedigree errors using KING.
The input files need to be in PLINK binary format, e.g., ex.bed, ex.fam, and ex.bim.
However, GATK3's VariantsToBinaryPed is the only tool I know of that requires as input a file stating family relationships among samples a priori, in order to create the above mentioned
.fam file (the first 6 columns of PLINK's
My question is, how are all the other
VCF -> PLINK conversion tools inferring relationships in order to output the
.fam file of the PLINK binary fileset?
Does the information contained in the .fam file, or any pedigree/sample metadata information, actually go into the .bed file?
Because if it doesn't... then I don't really need to worry about all of this - I could just create the .binary fileset from the VCF as instructed in the KING tutorial (without specifying a custom .fam), and then just supply my own .fam when running KING:
king -b ex.bed --fam ex.fam --bim ex.bim --related
My sense, though, is that pedigree info is somehow encoded in the .bed file, though, as other tools for converting VCF to PLINK, like the aforementioned VariantsToBinaryPed from GATK, specifically require a metadata file to create the PLINK binary fileset.
And, can one pass as input a
.fam file (or similar) to these tools if the sample relationships are actually known (from the pedigree), as is my case?
For some reason, the number of samples in my VCF is smaller than those in my .fam file (which includes all the samples that were sequenced.) I inherited these files, so I'm not sure what the criteria was or why these ~700 samples didn't make it into the VCF.
Is there any way to tell PLINK2 to ignore this discrepancy, or do I need to go back and figure out which samples were filtered in the VCF and remove them from the .fam?
Stepping back - ultimately, my goal is to use this for pedigree checking using KING - In their tutorial, it is stated that one should first convert the VCF to PLINK binary format, and a command is given:
The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:
plink2 --vcf example.vcf.gz --make-bed --out ex
I understand that this command will output a PLINK binary fileset (ex.fam , ex.bed, ex.bim) -- however, I already have a .fam with the known relationships between samples, which is in fact what I want to check for errors with KING. Surely there must be a way to read the VCF into PLINK, and take the FIDs from the supplied .fam, in order to create the binary fileset with a known .fam file?