Question: Different results between two methods for the extraction of SNPs, common for a subgroup from a multiVCF file
1
gravatar for pavlo.maksimov
27 days ago by
pavlo.maksimov20 wrote:

Dear all,

I try to get an subset of SNPs common for a three of six individuals from a multiVCF file. My problem is that I get different results when I do it

  1. manually (extract genotypes using "snpsift extractFields" and filter the variants by the excel)
  2. when I use "snpsift filter "

    SnpSift varType snps_results_dir/my_multiVCF_SNPs.vcf | SnpSift filter "isVariant( GEN[1]) & isVariant( GEN[2]) & isVariant( GEN[3]) & isRef( GEN[4]) & isRef( GEN[5]) & isRef( GEN[6]) & isRef( GEN[7]) & isRef( GEN[8])" > my_SNP_subset.vcf

With the first method I get 479 SNPs, however "snpsift filter" (second method) gives me about 250 SNPs.

So I'm confused. What is the right method/result? Could somebody help me with this question/discrepance? Are there any other standard procedure to filter the variants from the multiVCF file?

Thank you very much in advance

Kind regards

Pavlo

multivcf snp ngs filter snpsift • 149 views
ADD COMMENTlink modified 26 days ago • written 27 days ago by pavlo.maksimov20

Excel filters are not reproducible (and are susceptible to data import as well as manual errors) , so we cannot really help you with why you're seeing different results.

If you could use R instead of Excel to apply these filters, the script would help us determine what is behind the discrepancy.

ADD REPLYlink modified 27 days ago • written 27 days ago by RamRS24k

Thank you very much for the tip. I'll try it now.

ADD REPLYlink written 27 days ago by pavlo.maksimov20
1
gravatar for pavlo.maksimov
26 days ago by
pavlo.maksimov20 wrote:

Dear RamRS,

I figured out my problem. R produce the same results as excel. I had a mistake in the SnpSift filter. So now I have in all three variants (excel, R and SnpSift filter) the same output.

What I have done in R:

  1. I exported genotype data from my multiVCF file using following command:

    >snpsift extractFields -s "," -e "." - CHROM POS  "VARTYPE[*]" "GEN[1].GT" "GEN[2].GT" "GEN[3].GT" "GEN[4].GT" "GEN[5].GT" "GEN[6].GT" "GEN[7].GT" "GEN[8].GT" "GEN[9].GT" "GEN[10].GT" >  All_GT.txt
    
  2. Imported and filtered the data set into R:

    >library(tidyverse)
    
    >gt_dat = read.csv2("All_GT.csv", header=TRUE)
    
    >SNPs=gt_dat[grep("SNP",gt_dat$VARTYPE...),]
    
    >gt_dat_tibble = as_tibble(SNPs)
    >SNP_subset = gt_dat_tibble %>% filter(GEN.1..GT >=1,
    +                              GEN.5..GT >= 1,
    +                              GEN.6..GT >=1,
    +                              GEN.7..GT == 0,
    +                              GEN.8..GT == 0,
    +                              GEN.9..GT == 0,
    +                              GEN.10..GT == 0)
    
    > SNP_subset
    >A tibble: 479 x 13
    

Best

Pavlo

ADD COMMENTlink modified 26 days ago by RamRS24k • written 26 days ago by pavlo.maksimov20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 691 users visited in the last hour