Question

How to make small subset of large genotyping dataset with plink?

0

Entering edit mode

5.4 years ago

kynnjo ▴ 70

I have a collection of genotyping files, that, for the purpose of this question, I will call big.bed, big.ped, big.fam, big.map, big.vcf, etc. This dataset has information on ~1.8M SNPs and 877 samples.

I also have a list of ~1000 SNPs in a file wanted_snps.txt, one SNP per line.

I want to generate a collection of files tiny.bed, tiny.ped, tiny.fam, tiny.map, tiny.vcf consisting of the subsets of the data in the big.* files corresponding to the SNPs mentioned in wanted_snps.txt.

(In case it matters, we can safely assume that all the SNPs mentioned in wanted_snps.txt are represented in the big.* dataset.)

I understand that one can perform such subsetting using plink, but after poring over the online documentation, I still can't figure out how to do this.

Could someone show me what I commands I'd need to run to do this?

I am using plink version 1.9.

Thanks in advance!

SNP snp • 1.3k views

ADD COMMENT • link updated 5.4 years ago by Inquisitive8995 ▴ 270 • written 5.4 years ago by kynnjo ▴ 70

score 1 · Answer 1 · 2018-11-14

Hello,

If I am not wrong, you can use the --extract command in plink to do this.

To extract only a subset of SNPs, it is possible to specify a list of required SNPs and make a new file, or perform an analysis on this subset, by using the command

plink --file data --extract mysnps.txt

where the file is just a list of SNPs, one per line, e.g.
snp005
snp008
snp101

http://zzz.bwh.harvard.edu/plink/dataman.shtml#extract

Hope this solves your query.