Randomly select a number of variants and samples from a multi-sample VCF file (pVCF)
1
0
Entering edit mode
15 months ago
' ▴ 300

I have a set of pVCF files that I intend to slice.

For example, I want to take 100 samples and 200 variants from the file. How can I do this?

All the tutorials I have come across require using a "list" of sample names and list of variants. But can I just do this randomly?

vcf pvcf • 2.2k views
ADD COMMENT
1
Entering edit mode
15 months ago

All the tutorials I have come across require using a "list" of sample names and list of variants. But can I just do this randomly?

random list of 100 samples:

bcftools query -l in.vcf | shuf | head -n 100 > samples.txt

and 200 variants

bcftools view --header-only --samples-file samples.txt  in.vcf > out.vcf
bcftools view --no-header --samples-file samples.txt  in.vcf | awk '{printf("%f\t%s\n",rand(),$0);}' | sort -t $'\t'  -T . -k1,1g | head -n 200 | cut -f 2- >> out.vcf
ADD COMMENT
0
Entering edit mode

This is very helpful, and exactly what I was looking for. Is there any chance the subsetting of variants could be done with a known tool such as gatk SelectVariants? so as to avoid using awk, sort, etc.? Or would you say that the awk+sort approach is always safe?

ADD REPLY
0
Entering edit mode

They have been around for 50 years. I think they're safe.

ADD REPLY
0
Entering edit mode

Pierre Lindenbaum Thanks again for this excellent answer! I've tested this extensively and does exactly what I need. Though I'm still a bit struggling to understand some parts of it and the rationale behind it. For example, why does awk '{printf("%f\t%s\n",rand(),$0);}' print the same numbers in the first column every time I run it on the same file? I understand shuf is a bad solution for choosing variants because it loads the entire file into memory, but could sort -R or using echo $line $RANDOM achieve the same result or be faster? Currently downsampling my 30GB+ VCF file to 50 variants takes about 1 hour.

ADD REPLY
0
Entering edit mode

'{printf("%f\t%s\n",rand(),$0);}' appends a random number before each VCF line.

ADD REPLY
0
Entering edit mode

Look at tabix to (greatly) speed such queries

ADD REPLY

Login before adding your answer.

Traffic: 1622 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6