Different results between two methods for the extraction of SNPs, common for a subgroup from a multiVCF file
1
1
Entering edit mode
4.7 years ago
p.m ▴ 20

Dear all,

I try to get an subset of SNPs common for a three of six individuals from a multiVCF file. My problem is that I get different results when I do it

  1. manually (extract genotypes using "snpsift extractFields" and filter the variants by the excel)
  2. when I use "snpsift filter "

    SnpSift varType snps_results_dir/my_multiVCF_SNPs.vcf | SnpSift filter "isVariant( GEN[1]) & isVariant( GEN[2]) & isVariant( GEN[3]) & isRef( GEN[4]) & isRef( GEN[5]) & isRef( GEN[6]) & isRef( GEN[7]) & isRef( GEN[8])" > my_SNP_subset.vcf

With the first method I get 479 SNPs, however "snpsift filter" (second method) gives me about 250 SNPs.

So I'm confused. What is the right method/result? Could somebody help me with this question/discrepance? Are there any other standard procedure to filter the variants from the multiVCF file?

Thank you very much in advance

Kind regards

Pavlo

SNP SnpSift Filter MultiVCF NGS • 1.1k views
ADD COMMENT
0
Entering edit mode

Excel filters are not reproducible (and are susceptible to data import as well as manual errors) , so we cannot really help you with why you're seeing different results.

If you could use R instead of Excel to apply these filters, the script would help us determine what is behind the discrepancy.

ADD REPLY
0
Entering edit mode

Thank you very much for the tip. I'll try it now.

ADD REPLY
1
Entering edit mode
4.7 years ago
p.m ▴ 20

Dear RamRS,

I figured out my problem. R produce the same results as excel. I had a mistake in the SnpSift filter. So now I have in all three variants (excel, R and SnpSift filter) the same output.

What I have done in R:

  1. I exported genotype data from my multiVCF file using following command:

    >snpsift extractFields -s "," -e "." - CHROM POS  "VARTYPE[*]" "GEN[1].GT" "GEN[2].GT" "GEN[3].GT" "GEN[4].GT" "GEN[5].GT" "GEN[6].GT" "GEN[7].GT" "GEN[8].GT" "GEN[9].GT" "GEN[10].GT" >  All_GT.txt
    
  2. Imported and filtered the data set into R:

    >library(tidyverse)
    
    >gt_dat = read.csv2("All_GT.csv", header=TRUE)
    
    >SNPs=gt_dat[grep("SNP",gt_dat$VARTYPE...),]
    
    >gt_dat_tibble = as_tibble(SNPs)
    >SNP_subset = gt_dat_tibble %>% filter(GEN.1..GT >=1,
    +                              GEN.5..GT >= 1,
    +                              GEN.6..GT >=1,
    +                              GEN.7..GT == 0,
    +                              GEN.8..GT == 0,
    +                              GEN.9..GT == 0,
    +                              GEN.10..GT == 0)
    
    > SNP_subset
    >A tibble: 479 x 13
    

Best

Pavlo

ADD COMMENT

Login before adding your answer.

Traffic: 1801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6