Question: How to get sample names and genotype for SNP in multi-sample VCF file
0
gravatar for hellbio
10 weeks ago by
hellbio290
hellbio290 wrote:

Dear all,

I have a multi-sample (>100) VCF file and a list of SNPs with CHR and POS for which the sample names and their genotypes need to be extracted.

For example, i have a SNP chr1:23455. I would like to know how many and which samples have homozygous alternate allele, heterozygous allele and homozygous reference allele.

Is there any existing tool that provides the above summary?

Many thanks!!!

snp genotype gatk • 283 views
ADD COMMENTlink modified 10 weeks ago by Kevin Blighe17k • written 10 weeks ago by hellbio290
5
gravatar for Kevin Blighe
10 weeks ago by
Kevin Blighe17k
University College London Cancer Institute
Kevin Blighe17k wrote:

Hope that you're ready for this...

As Ram mentioned, you must normalise your VCF / BCF first; otherwise, this script will not work as expected. You can do this with:

bcftools norm -m-any MyVariants.vcf -Ov > MyVariants.Norm.vcf

I probably should explain what's going on here, too: It is divided into 4 parts:

  1. tabulates the number of samples having each variant type (as you requested). This will work for phased and/or unphased variants

  2. Looks through the file a second time, saves the sample names into an array called 'header', and then prints the indices of the array where a particular field (i.e. sample) has a particular type of variant. This is repeated for heterozygous, homozygous alt, and homozygouse (ref).

  3. (see 2)

  4. (see 2)

I've tested it and verified results on a handful of 1000 Genome variants. I strongly encourage you to do some rigorous testing.

paste <(bcftools view MyVariants.Norm.vcf | awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT\tnHet\tnHomAlt\tnHomRef"} \
    !/^#/ {hetCount=gsub(/0\|1|1\|0|0\/1|1\/0/,""); \
    homCountAlt=gsub(/1\|1|1\/1/,""); \
    homCountRef=gsub(/0\|0|0\/0/,""); \
    print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"hetCount"\t"homCountAlt"\t"homCountRef}') \
    \
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HetSamples"} \
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|1|1\|0|0\/1|1\/0/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
    \
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesAlt"} \
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/1\|1|1\/1/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
    \
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesRef"} \
    !/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|0|0\/0/,"", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
    \
| sed 's/,\t/\t/g' | sed 's/,$//g'

CHR POS     ID                  REF ALT nHet nHomAlt nHomRef    HetSamples  HomSamplesAlt   HomSamplesRef
1   10177   1:10177:10177:A:AC  A   AC  4   0   1   HG00096,HG00097,HG00099,HG00100     HG00101
1   10235   1:10235:10235:T:TA  T   TA  0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   10352   1:10352:10352:T:TA  T   TA  5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   10616   1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C  CCGCCGTTGCAAAGGCGCGCCG  C   0   5   0       HG00096,HG00097,HG00099,HG00100,HG00101 
1   10642   1:10642:10642:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11008   1:11008:11008:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11012   1:11012:11012:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   11063   1:11063:11063:T:G   T   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13110   1:13110:13110:G:A   G   A   1   0   4   HG00097     HG00096,HG00099,HG00100,HG00101
1   13116   1:13116:13116:T:G   T   G   2   0   3   HG00097,HG00101     HG00096,HG00099,HG00100
1   13118   1:13118:13118:A:G   A   G   2   0   3   HG00097,HG00101     HG00096,HG00099,HG00100
1   13273   1:13273:13273:G:C   G   C   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13284   1:13284:13284:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13380   1:13380:13380:C:G   C   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13483   1:13483:13483:G:C   G   C   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13494   1:13494:13494:A:G   A   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   13550   1:13550:13550:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   14464   1:14464:14464:A:T   A   T   1   1   3   HG00099 HG00096 HG00097,HG00100,HG00101
1   14599   1:14599:14599:T:A   T   A   2   0   3   HG00097,HG00099     HG00096,HG00100,HG00101
1   14604   1:14604:14604:A:G   A   G   2   0   3   HG00097,HG00099     HG00096,HG00100,HG00101
1   14930   1:14930:14930:A:G   A   G   5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   14933   1:14933:14933:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15211   1:15211:15211:T:G   T   G   5   0   0   HG00096,HG00097,HG00099,HG00100,HG00101     
1   15245   1:15245:15245:C:T   C   T   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15274   1:15274:15274:A:G   A   G   3   0   2   HG00096,HG00100,HG00101     HG00097,HG00099
1   15274   1:15274:15274:A:T   A   T   3   2   0   HG00096,HG00100,HG00101 HG00097,HG00099 
1   15585   1:15585:15585:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15644   1:15644:15644:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15774   1:15774:15774:G:A   G   A   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
1   15777   1:15777:15777:A:G   A   G   0   0   5           HG00096,HG00097,HG00099,HG00100,HG00101
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Kevin Blighe17k
2
gravatar for Ram
10 weeks ago by
Ram14k
New York
Ram14k wrote:

I'd use bcftools query to output a file containing %GT, and then process that file using R or python or even awk

ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by Ram14k

Yes, take a look at my previous answer here: A: calculate Per variant Heterozygosity from VCF file

This modified version outputs the extra sites in which you're interested (and works for both phased and unphased, and mixed of these):

bcftools view chr1.1kg.phase3.v5.bcf | awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT\tHetCount\tHomCount (Alt)\tHomCount (Ref)"} !/^#/ {hetCount=gsub(/0\|1|1\|0|0\/1|1\/0/,""); homCountAlt=gsub(/1\|1|1\/1/,""); homCountRef=gsub(/0\|0|0\/0/,""); print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"hetCount"\t"homCountAlt"\t"homCountRef}'
CHR POS ID  REF ALT HetCount    HomCount (Alt)  HomCount (Ref)
1   10177   1:10177:10177:A:AC  A   AC  1490    320 694
1   10235   1:10235:10235:T:TA  T   TA  6   0   2498
1   10352   1:10352:10352:T:TA  T   TA  2025    83  396
1   10616   1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C  CCGCCGTTGCAAAGGCGCGCCG  C   35  2469    0
1   10642   1:10642:10642:G:A   G   A   21  0   2483
1   11008   1:11008:11008:C:G   C   G   403 19  2082
1   11012   1:11012:11012:C:G   C   G   403 19  2082
1   11063   1:11063:11063:T:G   T   G   15  0   2489
1   13110   1:13110:13110:G:A   G   A   134 0   2370
1   13116   1:13116:13116:T:G   T   G   414 36  2054
1   13118   1:13118:13118:A:G   A   G   414 36  2054
1   13273   1:13273:13273:G:C   G   C   444 16  2044
1   13284   1:13284:13284:G:A   G   A   7   0   2497
1   13380   1:13380:13380:C:G   C   G   41  0   2463
1   13483   1:13483:13483:G:C   G   C   10  0   2494
1   13494   1:13494:13494:A:G   A   G   7   0   2497
1   13550   1:13550:13550:G:A   G   A   17  0   2487
1   14464   1:14464:14464:A:T   A   T   428 26  2050
1   14599   1:14599:14599:T:A   T   A   711 14  1779
ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Kevin Blighe17k

So,bcftools norm -m - -Ou $VCF | your_one_liner?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by Ram14k

Something along htose lines - yes! It is critical to normalise the VCF first.

Also, this does not output sample names that have each genotype. That would require extra coding

ADD REPLYlink written 10 weeks ago by Kevin Blighe17k

Sample names with each GT would be bcftools query -f "... %GT"

ADD REPLYlink written 10 weeks ago by Ram14k

do you mean that it should be done in steps? to get counts and sample names?

ADD REPLYlink written 10 weeks ago by hellbio290

From what i can see, it would be better to break it up into different steps and then merge the output. I know how to do it but don't have time right now. I may try later today. Of course, somebody else may have a solution by then.

ADD REPLYlink written 10 weeks ago by Kevin Blighe17k

Yep, you'll need at least 2 - a "map" step to get genotypes for each individual at each site and a "reduce" step, to get counts or a CSV-list of per-genotype individuals.

ADD REPLYlink written 10 weeks ago by Ram14k

I've posted an answer below, in part because you'll appreciate the greater room in viewing the code. Thanks to Ram for input (upvote)

ADD REPLYlink written 10 weeks ago by Kevin Blighe17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 957 users visited in the last hour