How to use plink to consolidate multiple single-sample vcf files?
1
0
Entering edit mode
19 months ago
kynnjo ▴ 40

I have a few hundred single-sample vcf files that I need to consolidate into a single vcf file.

My understanding is that plink is one of the many tools that one can use for doing this. I would prefer to use plink if possible, even if it is not the best one, because I am already using plink for other operations, and I want to keep my toolchain as small as possible.

Unfortunately, I am having a hard time locating the relevant details for this task in the plink documentation.

Could someone kindly either post the plink command-line to perform such consolidation, or else a link to the relevant section(s) in the plink documentation?

EDIT: I am using plink version 1.90b, for backward-compatibility with the rest of the project I am working on.

(If the question above was clear enough to you, you can skip the rest of this post.)

Each single-sample vcf file consists of a comment/header section (each line beginning with ##), followed by a single row of tab-separated column headers (which begins with a single #), followed by ~1.7M tab-separated rows of metadata and data.

The comment/header section is identical across all these files.

The rows in all these files correspond to the set of SNPs and other variants that are probed by the same genotyping chip. In other words, the rows of all these files are consistent with each other.

The rows section of all these files consist of 10 tab-separated columns, the first 9 of which hold metadata, and are identical across all the files. Only the 10th column (including its column header) contains sample-specific data, and therefore differs across the files.

Accordingly, the first 9 column headers are identical across these files, while the 10 column header is a sample-specific identifier, and hence unique to each file.

Therefore, "consolidation" here means producing a file in which the leading header/comment section and the first 9 columns are identical to those of any of the single-sample vcf files, and the remaining (tab-separated) columns correspond to the 10th columns of all the single-sample vcf files. Conceptually, this a relatively simple operation. The problem is to do it efficiently.

genotyping vcf plink SNP • 462 views
3
Entering edit mode
19 months ago
1. Convert each VCF to plink binary format.
2. Use —merge-list for the actual merge (https://www.cog-genomics.org/plink/1.9/data#merge_list )
3. Export the result as a VCF. You’ll probably need to use —a2-allele to keep REF/ALT alleles straight.
4. Use e.g. a short shell script to add the header lines back to the final result.
0
Entering edit mode

Thank you. Just from your answer I can guess one reason for my failure to find what I was looking for: I thought this would be a single command operation.

In my original post I neglected to mention that I am using plink version 1.90b, for backward-compatibility with the rest of the project I am working on. I hope that your recipe works with this version of plink too. (I have now edited my query to fix this omission.)

I have not yet been able to perform the first step (the conversion of the vcf files to plink binary format). I tried a command of the form

    plink --file /path/to/my.vcf --out /path/to/my.plink


...but plink fails immediately with an error about not being able to find /path/to/my.vcf.map.

Would you mind posting example command lines steps 1-3 of your procedure? (I think I have a handle on step 4.)

2
Entering edit mode
plink --vcf input1.vcf --out converted1
...
plink2 --bfile merged --ref-from-fa reference.fa --export vcf --out result


(This shows an alternative approach for step 3 which requires plink 2.0. As mentioned in my original answer, if you have a single file with all the correct reference alleles, you can use plink 1.9 --a2-allele instead.)