I'm using plink's linkage disequilibrium logic to filter my variants. The team I'm working with needs everything to be kept as VCF files, but one problem is that the original VCF files don't contain any variant IDs, so when I later use plink's --extract
command I end up with no variants. I tried using plink's --recover-var-ids
arg to hopefully give a hint to plink that it needs to parse the --extract
variant IDs, but plink complains that --recover-var-ids
is not a recognized flag. My commands look like this:
plink \
--vcf $INPUT_FILE_PATH \
--indep $LD_WINDOW_SIZE_KB $LD_STEP_SIZE $VIF_THRESHOLD \
--out $LD_FILE_PATH \
--set-missing-var-ids @:#[b38]\$1,\$2 \
--allow-extra-chr
plink \
--vcf /inputs/data.vcf.gz \
--extract ${LD_FILE_PATH}.prune.in \
--out /outputs/data.vcf.gz \
--recover-var-ids \
--recode vcf
Would this not be a problem if I used plink's --bfile
instead of VCF files? I suppose I could just add another step to convert the bed files back to VCFs.
--recover-var-ids
, and decent VCF re-export capability, require plink 2.0. (plink 1.9 doesn't even keep REF/ALT allele order straight by default, because there was no way to do so without breaking compatibility with plink 1.07.)--recover-var-ids
here. (When you do, it is necessary to also provide a file with the IDs; see https://www.cog-genomics.org/plink/2.0/data#recover_var_ids for details.) Instead, you have the right idea with--bfile
, except that you probably want to use--pfile
/--make-pgen
instead (that format is capable of preserving many more types of information in the VCF). Alternatively, you could include the same--set-missing-var-ids
template in the second command that you did in the first.