Question: filtering a VCF file based on genotype
0
gravatar for Bogdan
18 months ago by
Bogdan880
Palo Alto, CA, USA
Bogdan880 wrote:

Dear all, I would appreciate having a piece of advice :

how shall I filter a VCF file in order to exclude the GERMLINE and TUMOR samples that have a GENOTYPE of "./".

in R/BioC, for example, if we aim to see the GENOTYPES in the vcf file, we receive the following messages :

> geno(vcf)$GT[,"TUMOR"]
Error in geno(vcf)$GT[, "TUMOR"] : subscript out of bounds
> geno(vcf)$GT[,"NORMAL"]
Error in geno(vcf)$GT[, "NORMAL"] : subscript out of bounds

thank you very much !

vcf • 774 views
ADD COMMENTlink modified 18 months ago by RamRS24k • written 18 months ago by Bogdan880
1

What tools are you using? "R/BioC" have multiple tools. Can you please give us the exact commands/scripts you're using?

ADD REPLYlink written 18 months ago by RamRS24k
1

Dear gentlemen, thank you for your replies and help. Here I am posting the R code that I am using to do the filtering of a VCF file that is generated by DELLY and contains INVERSIONs only .. Any suggestions on how to post/link the input VCF file would be welcome. Thank you very much !

library(devtools)
library(stringr)
library(dplyr)
library(ggplot2)

library(VCFWrenchR)
library(VariantAnnotation)
library(VariantTools)

library(GenomicRanges)
library(GenomicFeatures)
library(rtracklayer)
library(org.Hs.eg.db)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(StructuralVariantAnnotation) 


vcf <- readVcf("vcf.INV3.vcf", 
                genome="hg38")

name <- "vcf.INV3.vcf"

TUMOR  <-  samples(header(vcf))[1] 
NORMAL <-  samples(header(vcf))[2]


DEPTH_threshold <- 8
FRACTION_threshold <- 0.2 



PASS_IMPRECISE_filters = function(x) { DV_germline <- ( geno(vcf)$DV[,NORMAL] < 1 )
                                       RV_germline <- ( geno(vcf)$RV[,NORMAL] < 1 )

                                       DV_tumor <- geno(vcf)$DV[,TUMOR]
                                       RV_tumor <- geno(vcf)$RV[,TUMOR]  

                                       DR_tumor <- geno(vcf)$DR[,TUMOR]
                                       RR_tumor <- geno(vcf)$RR[,TUMOR] 

                                       AD <-  (DV_tumor > DEPTH_threshold)      
                                       AF <- ((DV_tumor / (DV_tumor + DR_tumor)) > FRACTION_threshold)

                                       DV_germline &
                                       RV_germline & 
                                       AD &
                                       AF & 
                                       (filt(vcf) == "PASS") & 
                                       (info(vcf)$IMPRECISE) 
                                      }

vcf_PASS_IMPRECISE_FILTERS <- vcf[PASS_IMPRECISE_filters(vcf)]

writeVcf(vcf_PASS_IMPRECISE_FILTERS,
         file=paste(name, ".filter.PASS.IMPRECISE.vcf", sep=""))



PASS_PRECISE_filters = function(x) {   DV_germline <- ( geno(vcf)$DV[,NORMAL] < 1 )
                                       RV_germline <- ( geno(vcf)$RV[,NORMAL] < 1 )

                                       DV_tumor <- geno(vcf)$DV[,TUMOR]
                                       RV_tumor <- geno(vcf)$RV[,TUMOR]  

                                       DR_tumor <- geno(vcf)$DR[,TUMOR]
                                       RR_tumor <- geno(vcf)$RR[,TUMOR] 

                                       AD <- ((DV_tumor + RV_tumor) > DEPTH_threshold)      
                                       AF <- ((RV_tumor / (RV_tumor + RR_tumor)) > FRACTION_threshold) 

                                       DV_germline &
                                       RV_germline & 
                                       AD &
                                       AF & 
                                       (filt(vcf) == "PASS") & 
                                       (info(vcf)$PRECISE) 
                                    }


vcf_PASS_PRECISE_FILTERS <- vcf[PASS_PRECISE_filters(vcf)]

writeVcf(vcf_PASS_PRECISE_FILTERS, 
         file=paste(name, ".filter.PASS.PRECISE.vcf", sep=""))


vcf_PASS_COMBINED_FILTERS <- rbind(vcf_PASS_IMPRECISE_FILTERS, vcf_PASS_PRECISE_FILTERS) 

writeVcf(vcf_PASS_COMBINED_FILTERS, 
         file=paste(name, ".filter.PASS.PRECISE.and.IMPRECISE.and.AD.and.AF.vcf", sep=""))
ADD REPLYlink written 18 months ago by Bogdan880

Can you give some background on how the VCF was generated?

ADD REPLYlink written 18 months ago by ATpoint25k

Please post the program you are using and the sample vcf file.

ADD REPLYlink written 18 months ago by cpad011212k

I will add to the comments from the 3 other great people. R would not be my 'go to' environment for filtering a VCF. To do what you want, just use bcftools view with the following parameter:

-m/M, --min-alleles/--max-alleles <int>     minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
ADD REPLYlink modified 18 months ago • written 18 months ago by Kevin Blighe51k
0
gravatar for Jorge Amigo
18 months ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

from your question I understand that you have 1 single vcf file with 2 samples (germline and tumor) in it, and that you want to filter out all variants where 1 or the 2 samples have an empty genotype (./.)

if this is the case, a simple grep should be enough:

grep -vP '\t\./\.' file.vcf > file.non-empty-gt.vcf
ADD COMMENTlink modified 18 months ago • written 18 months ago by Jorge Amigo11k

Thank you Jorge. Yes, at the end, I have done it with BCFTOOlS:

bcftools view --genotype ^miss $FILE  > "${FILE%.vcf}.fGT.vcf"

I was hoping to do everything in R, as it would be easier to assemble the entire pipeline in R (than to assemble all the scripts I have in a master bash shell .sh script ) .

ADD REPLYlink modified 18 months ago by RamRS24k • written 18 months ago by Bogdan880
2

Unfortunately, R is not the best for all tasks. You're better off using an automation/pipelining tool that is built for interoperability. People recommend snakemake, but I am not sure how well it would fit your use case, as I'm not conversant with snakemake. For the moment, you can either use system() calls from within R or Rscript/R commands within a shell script.

ADD REPLYlink written 18 months ago by RamRS24k

Thank you Ram for your help. Yes, i've used the "Rscript" calls from a shell script.

ADD REPLYlink written 18 months ago by Bogdan880

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 18 months ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1142 users visited in the last hour