Converting VCF to PLINK .bed binary fileset to check for pedigree errors with KING: How do conversion tools make the PLINK .fam file, without asking for family relationships a priori?
0
8
Entering edit mode
4.1 years ago
gaelgarcia ▴ 250

I'm trying to convert a VCF of ~15K samples (~4K families) into PLINK binary format to check for any pedigree errors using KING.

The input files need to be in PLINK binary format, e.g., ex.bed, ex.fam, and ex.bim.

There are multiple tools out there to convert a VCF to PLINK binary format, namely the .bed, .bim. and .fam files of the 'binary fileset'.

However, GATK3's VariantsToBinaryPed is the only tool I know of that requires as input a file stating family relationships among samples a priori, in order to create the above mentioned .fam file (the first 6 columns of PLINK's .ped file).

My question is, how are all the other VCF -> PLINK conversion tools inferring relationships in order to output the .fam file of the PLINK binary fileset?

Does the information contained in the .fam file, or any pedigree/sample metadata information, actually go into the .bed file?

Because if it doesn't... then I don't really need to worry about all of this - I could just create the .binary fileset from the VCF as instructed in the KING tutorial (without specifying a custom .fam), and then just supply my own .fam when running KING:

king -b ex.bed --fam ex.fam --bim ex.bim --related


My sense, though, is that pedigree info is somehow encoded in the .bed file, though, as other tools for converting VCF to PLINK, like the aforementioned VariantsToBinaryPed from GATK, specifically require a metadata file to create the PLINK binary fileset.

And, can one pass as input a .fam file (or similar) to these tools if the sample relationships are actually known (from the pedigree), as is my case?

Thanks!

UPDATE:

For some reason, the number of samples in my VCF is smaller than those in my .fam file (which includes all the samples that were sequenced.) I inherited these files, so I'm not sure what the criteria was or why these ~700 samples didn't make it into the VCF.

Is there any way to tell PLINK2 to ignore this discrepancy, or do I need to go back and figure out which samples were filtered in the VCF and remove them from the .fam?

UPDATE 2

Stepping back - ultimately, my goal is to use this for pedigree checking using KING - In their tutorial, it is stated that one should first convert the VCF to PLINK binary format, and a command is given:

The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:

plink2 --vcf example.vcf.gz --make-bed --out ex


I understand that this command will output a PLINK binary fileset (ex.fam , ex.bed, ex.bim) -- however, I already have a .fam with the known relationships between samples, which is in fact what I want to check for errors with KING. Surely there must be a way to read the VCF into PLINK, and take the FIDs from the supplied .fam, in order to create the binary fileset with a known .fam file?

Thanks again.

plink vcf fam gatk vcftools • 30k views
1
Entering edit mode

plink 2.0 lets you use --fam with --vcf, and verifies that the sample IDs are consistent before proceeding.

1
Entering edit mode

Hello again, I guess I'm still confused about the right way to go about this -- have not been able to get past the VCF/.fam mismatch error.

I'm not sure that the options suggested for interpreting the VCF sampleIDs will help my particular case (--double-id, --id-delim, --const-fid) - The trouble is that the SampleIDs in the VCF are an exact match to the IID column of the .fam file.

I've triple-checked that these two lists contain the same elements, and I'm passing the --indiv-sort fileparameter, with a sort file that is an exact copy of the first two columns (FID and IID) of the .fam file, so the VCF sampleIDs should indeed be read by PLINK in the exact same order as the .fam IIDs.

My VCF does not contain FIDs in the SampleIDs (just the IIDs), but the FIDs are stated in the .fam -- surely there must be a way to read in the VCF and take the FIDs from the supplied .fam to create the binary fileset?

Stepping back - ultimately, my goal is to use this for KING - In their tutorial, it is stated that one should first convert the VCF to PLINK binary format, and a command is given:

The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:

> plink2 --vcf example.vcf.gz --make-bed --out ex


I understand that this command will output a PLINK binary fileset (ex.fam , ex.bed, ex.bim) -- however, I already have a .fam with the known relationships between samples, which is in fact what I want to check for errors with KING.

Apologies if I'm being too silly - I think I'm just very confused at this point!

1
Entering edit mode

My VCF does not contain FIDs in the SampleIDs (just the IIDs), but the FIDs are stated in the .fam -- surely there must be a way to read in the VCF and take the FIDs from the supplied .fam to create the binary fileset?

Then you will have to code the sample IDs in your VCF as FID_IID, and use --id-delim _ when converting to plink. Data conversion steps are rarely streamlined and form a large part of our work, at least from what I can see.

Yes, the developers of KING obviously didn't realise the glaring issue of losing family relationships when just using plink2 --vcf

1
Entering edit mode

Does the information contained in the .fam file, or any pedigree/sample metadata information, actually go into the .bed file?

Because if it doesn't... then I don't really need to worry about all of this - I could just create the .binary fileset from the VCF as instructed in the KING tutorial (without specifying a custom .fam), and then just supply my own .fam when running KING:

king -b ex.bed --fam ex.fam --bim ex.bim --related


My sense, though, is that pedigree info is somehow encoded in the .bed file, though, as other tools for converting VCF to PLINK, like the aforementioned VariantsToBinaryPed from GATK, specifically require a metadata file to create the PLINK binary fileset.

1
Entering edit mode

No, the .bed file does not contain any pedigree/sample information; it's nothing but a matrix of {0, 1, 2, missing} allele count values. (Even the matrix dimensions aren't in the file!)

0
Entering edit mode

Ahh, that explains so much! Thanks!

0
Entering edit mode

You can use plink 1.9 —update-ids + —make-bed to fill in the initially-missing FIDs (https://www.cog-genomics.org/plink/1.9/data#update_indiv ). This requires you to create a four-column file describing how you want to change the IDs.

0
Entering edit mode

Thank you! I do with all these options were clearly listed in the documentation... 😔

0
Entering edit mode

@chrchang: So in this case, if I run :

./plink2 --make-bed --vcf ~/path/to/mips.vcf.bgz --fam ~/path/to/mips_pedigree.fam --out mips ,

then (if I understand correctly), --make-bed will output a mips.fam file identical to the .fam file passed with --fam?

2
Entering edit mode

If you do not specify any FAM file when you are converting from VCF/BCF to plink format, then plink will just create a 'dummy' FAM file with the same name as your dataset. It will look something like this:

0 16367 0 0 0 -9
0 16407 0 0 0 -9
0 16402 0 0 0 -9
0 16382 0 0 0 -9
0 16392 0 0 0 -9
0 16362 0 0 0 -9
0 16372 0 0 0 -9
0 16377 0 0 0 -9
0 16397 0 0 0 -9
0 16387 0 0 0 -9


When you read your data into plink, I strongly encourage you to make the use of a 'sort file'. For example:

plink --noweb --bcf MyData.bcf --keep-allele-order --indiv-sort file IDsort.list --vcf-idspace-to _  --id-delim _ --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyData ;


It does not matter how your samples are positioned in the BCF/VCF file - the use of the --indiv-sort file parameter instructs plink to read samples into the plink object in that order. The file IDsort.list contains (FID IID):

GreekFam1   16367
GreekFam1   16407
GreekFam2   16402
GreekFam2   16382
GreekFam3   16392
GreekFam3   16362
GreekFam4   16372
GreekFam4   16377
GreekFam5   16397
GreekFam5   16387
GreekFam6   16396
GreekFam6   16366
GreekFam7   16405
GreekFam7   16400
GreekFam8   16410
GreekFam8   16376
GreekFam9   16381
GreekFam9   16401
GreekFam9   16380
GreekFam9   16386
GreekFam10  16411
GreekFam10  16412
GreekFam10  16375
GreekFam11  16390
GreekFam11  16406
GreekFam11  16395
GreekFam11  16417
GreekFam12  16385
GreekFam12  16391


Why is this important? It's important because, without specifying the order, plink sometimes inputs the samples seemingly randomly, and may not match the order in your custom FAM file. So, when you now want to create your FAM file, ensure 100% that the sample order in the FAM file matches the order in the IDsort.list file.

Note also the --id-delim _ parameter. This instructs plink that, in my VCF/BCF, the FID and IID are split by an underscore. So, in your VCF/BCF, you should rename your sample IDs to something like:

GreekFam1_16367
GreekFam1_16407
GreekFam2_16402
GreekFam2_16382
et cetera


You can do this with bcftools reheader

## ----------------------------------------------

After that, use the --fam parameter in every command that you run with plink. For example:

plink --noweb --bfile MyData --tdt --model mperm=500 --ci 0.95 --fam Custom.fam --out MyDataTDT ;

0
Entering edit mode

Thank you for this extremely valuable heads-up!

0
Entering edit mode

Kevin - Do the sample IDs in the VCF have to include both FID and IID?

My VCF Sample IDs consist of only the IID -- I'm still getting an error message after filtering the .fam file to contain only samples found in the VCF, in addition to specifying the order with an IDsort file.

./plink2 --vcf norm.vcf.bgz  --fam targeted_seq_subsetVCF.fam -indiv-sort file IDsort.list --make-bed --out mips-take2


Where IDsort.list is the first two columns of targeted_seq_subsetVCF.fam.

1
Entering edit mode

No, you can leave it as just the sample ID on its own - sorry not to make that clear. However, if you then try to use a custom FAM with FIDs set, then plink will throw an error / not function as expected. In most of my work, I had no family data, so, in my custom FAM the FIDs were all zeros.

As chrchang523 mentions you can then later update the FIDs later within plink with --update-parents.

Having to change IDs in a VCF is indeed a nuisance.

0
Entering edit mode

Hi, would like to confirm, for FID which are zeros, it would not be considered as one family in relationship inference? I was worried the same FID might lead samples being regarded as same family, and wonder if we need unique FID for all?

1
Entering edit mode

The command may error out if the sample IDs, or their order, aren’t identical between the .fam and the .vcf; you may need to use —double-id/—id-delim/—const-fid to change how the VCF IDs are interpreted. But if it works, the output .fam should be the same.

0
Entering edit mode

Thanks @Kevin and @chrchang -

For some reason, the number of samples in my VCF is smaller than those in my .fam file (which includes all the samples that were sequenced.) I inherited these files, so I'm not sure what the criteria was or why these ~700 samples didn't make it into the VCF.

Is there any way to tell PLINK2 to ignore this discrepancy, or do I need to go back and figure out which samples were filtered in the VCF and remove them from the .fam?

3
Entering edit mode

No, you can't tell plink2 to ignore the discrepancy, but in that case it's pretty straightforward to just import the VCF without pedigree info, and then patch in parental/sex(/phenotype?) information afterward with plink 1.9 --update-parents/--update-sex/--pheno.

1
Entering edit mode

Is it giving an error due to that discrepancy? Perhaps chrchang523 knows of a specific parameter that can be used.

Personally, I would indeed ensure that missing samples were not included in the FAM. You could output the row in your VCF that has the sample names and then convert them into a list format (plain text). Then use grep -f in oder to filter only those samples from the FAM.

Actually, I jst tried that and it seems that it would be a quick solution. Here's how you could get your list of samples in the VCF:

bcftools view 1000Genomes.Norm.bcf | head -100 | awk '/#CHROM/ {print}' | sed 's/\t/\n/g' | head -30
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
HG00096
HG00097
HG00099
HG00100
HG00101
HG00102
HG00103
HG00105
HG00106
HG00107
HG00108
HG00109
HG00110
HG00111
HG00112
HG00113
HG00114
HG00115
HG00116
HG00117
HG00118


Obviously then just remove the unnecessary first few rows (#CHROM POS, etc)). After that, use grep -f to filter the FAM

1
Entering edit mode

Thank you, Kevin! I'll give this a try. In fact, it does return an error because of this discrepancy:

    Error: Mismatched IDs between --vcf file and
/Users/gaelgarcia/targeted_seq.fam.


I'm pretty sure this is because of the missing samples in the VCF - the IIDs are identical between .fam and VCF.

Just for the sake of saving other newbies some time, here's how I extracted the sampleIDs from the VCF on Mac:

zcat < norm.vcf.bgz | head -100 | awk '/#CHROM/ {print}' | sed 's/   /\'$'\n/g' > vcf_sampleIDs.txt  And then subset the original .fam to filter out samples NOT found in the VCF. grep -f vcf_sampleIDs.txt targeted_seq.fam > targeted_seq_subsetVCF.fam  Conversely, to get the sampleIDs missing from the VCF: grep -v -f vcf_sampleIDs.txt targeted_seq.fam > targeted_seq_filtered-fromVCF.fam  ADD REPLY 0 Entering edit mode I just realized there's a problem with this logic -- the command gets every line from the .fam file where's a match to the sampleIDs extracted above. This means that samples that are not in the VCF, but, for example, their parents are, will have a match, and still be included. I reckon I should restrict grep to look for matches only in the 2nd column of the .famfile. UPDATE: This seems to do the trick - awk 'BEGIN { while(getline <"vcf_sampleIDs.txt") id[$0]=1; }
id[\$2] ' targeted_seq.fam

1
Entering edit mode

Could could try awk instead. Match on the IID column and, if there's a match, print the entire line