Plink altering VCF IDs
1
0
Entering edit mode
3.7 years ago
dec986 ▴ 370

I'm running a VCF command thus:

plink --vcf chrX.M.vcf.gz --set-hh-missing --output-chr M --recode vcf --out chrX.M.set_hh_missing

but the output is distorted in that the sample IDs are doubled, for example

ABCDEF becomes ABCDEF_ABCDEF

None of the options seem like they would do this, after reading the instructions. Why is this happening? How can I prevent the sample IDs from doubling like this?

plink • 1.6k views
ADD COMMENT
3
Entering edit mode
3.7 years ago

plink 1.x requires two-part sample IDs, so there are some rough edges re: import/export of single-part sample IDs. See https://www.cog-genomics.org/plink/1.9/input#double_id and https://www.cog-genomics.org/plink/1.9/data#recode for details.

Two workarounds:

  • Add "--const-fid 0 --keep-allele-order", and replace "--recode vcf" with "--recode vcf-iid" in your command line.
  • Use plink 2.0 when working with VCFs. In addition to natively supporting single-part sample IDs, it preserves VCF header lines and QUAL/FILTER/INFO columns, does not automatically swap REF/ALT alleles on you (this is what --keep-allele-order in the first workaround counteracts), and can handle multiallelic variants, phase, and dosage data.

Also note that you can add 'bgz' to --recode ("--recode vcf-iid bgz") to request bgzipping of the VCF.

ADD COMMENT

Login before adding your answer.

Traffic: 1821 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6