Question: extracting genotypes from a multi-sample VCF that have certain variants
gravatar for Floydian_slip
4.6 years ago by
United States
Floydian_slip130 wrote:

Hi, I have a set of variants and a multi-sample merged VCF that indicates the genotype for each sample. Is there a way to extract the sample names that haver those variants? Ideally, I am looking to do this at each variant: variant followed by the names of the samples that have that variant.

Thanks a lot in advance! ~N

genotypes vcf • 1.9k views
ADD COMMENTlink written 4.6 years ago by Floydian_slip130

it's not clear to me where you're looking for genotype (sample,A1,A2) and variant (chrom/pos/ref/alts), what are your inputs...

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Pierre Lindenbaum130k

I have 2 inputs: 1. a vcf file with a set of variants. 2. Another merged VCF file from multiple individuals that indicates for each variant what is the genotype (present, absent, etc).

Now, all the individual may not have the variants from the first file. What I would like to know is which samples have each of the variants from the first file. Eg., variant1 from file1 is present in these samples from file2.
I hope that is clear.

ADD REPLYlink written 4.6 years ago by Floydian_slip130

So, I figured out a way: first, I can used betools intersect the two files to get only those lines in the multi-sample merged VCF file that contains the variants that I want information for. Next, from the resultant file, I can easily parse the columns corresponding to the genoptypes of each sample and extract only those column headings (and hence the sample names) that have that variant (0/1 or 1/2 meaning that they have that variant in some form) using awk, cut, etc.


ADD REPLYlink written 4.6 years ago by Floydian_slip130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1828 users visited in the last hour