Selecting a Subset of Samples Based on Genotype Variety of Multiple Variants
0
0
Entering edit mode
4.1 years ago
maegsul ▴ 170

Hi, I have processed a VCF of ~1000 samples with ~30 variants of interest (SNPs from GWAS). I converted genotype values to 0, 1 and 2 based on number of risk alleles. I have a tab-delimited table as below:

sample  rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 . . . . . . rs30
sample1 0   0   2   1   1   0   1   2
sample2 1   0   1   1   1   0   1   1
sample3 0   0   2   2   1   0   0   2
sample4 0   0   1   1   0   0   0   1
sample5 1   0   1   1   1   0   0   1
sample6 0   0   0   2   0   0   1   1
sample7 0   0   1   0   2   0   0   2
sample8 0   0   1   1   0   0   2   0
sample9 1   0   1   1   1   0   0   1
.
.
.
.
sample1000

I am looking for a way to randomly choose a group of samples with enough genotype variety for a follow-up experiment. For instance, I would like to print 11 lines (=samples) that is following this combination criteria: n=3 0 genotype, n=5 1 genotype, n=3 2 genotype for each rsIDs, optimally.

Is there an easy way to do this? Thanks in advance!

SNP • 532 views
ADD COMMENT

Login before adding your answer.

Traffic: 2191 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6