Entering edit mode
10 weeks ago
'
▴
290
I have a set of pVCF
files that I intend to slice.
For example, I want to take 100 samples and 200 variants from the file. How can I do this?
All the tutorials I have come across require using a "list" of sample names and list of variants. But can I just do this randomly?
This is very helpful, and exactly what I was looking for. Is there any chance the subsetting of variants could be done with a known tool such as
gatk SelectVariants
? so as to avoid using awk, sort, etc.? Or would you say that the awk+sort approach is always safe?They have been around for 50 years. I think they're safe.
Pierre Lindenbaum Thanks again for this excellent answer! I've tested this extensively and does exactly what I need. Though I'm still a bit struggling to understand some parts of it and the rationale behind it. For example, why does
awk '{printf("%f\t%s\n",rand(),$0);}'
print the same numbers in the first column every time I run it on the same file? I understandshuf
is a bad solution for choosing variants because it loads the entire file into memory, but couldsort -R
or usingecho $line $RANDOM
achieve the same result or be faster? Currently downsampling my 30GB+ VCF file to 50 variants takes about 1 hour.'{printf("%f\t%s\n",rand(),$0);}' appends a random number before each VCF line.