How to make small subset of large genotyping dataset with plink?
2.9 years ago
kynnjo ▴ 40

I have a collection of genotyping files, that, for the purpose of this question, I will call big.bed, big.ped, big.fam,, big.vcf, etc. This dataset has information on ~1.8M SNPs and 877 samples.

I also have a list of ~1000 SNPs in a file wanted_snps.txt, one SNP per line.

I want to generate a collection of files tiny.bed, tiny.ped, tiny.fam,, tiny.vcf consisting of the subsets of the data in the big.* files corresponding to the SNPs mentioned in wanted_snps.txt.

(In case it matters, we can safely assume that all the SNPs mentioned in wanted_snps.txt are represented in the big.* dataset.)

I understand that one can perform such subsetting using plink, but after poring over the online documentation, I still can't figure out how to do this.

Could someone show me what I commands I'd need to run to do this?

I am using plink version 1.9.

Thanks in advance!

2.9 years ago


If I am not wrong, you can use the --extract command in plink to do this.

To extract only a subset of SNPs, it is possible to specify a list of required SNPs and make a new file, or perform an analysis on this subset, by using the command

plink --file data --extract mysnps.txt

where the file is just a list of SNPs, one per line, e.g.

Hope this solves your query.


