Question: filtering a VCF file based on genotype
0
gravatar for Bogdan
2.1 years ago by
Bogdan1000
Palo Alto, CA, USA
Bogdan1000 wrote:

Dear all, I would appreciate having a piece of advice :

how shall I filter a VCF file in order to exclude the GERMLINE and TUMOR samples that have a GENOTYPE of "./".

in R/BioC, for example, if we aim to see the GENOTYPES in the vcf file, we receive the following messages :

> geno(vcf)$GT[,"TUMOR"]
Error in geno(vcf)$GT[, "TUMOR"] : subscript out of bounds
> geno(vcf)$GT[,"NORMAL"]
Error in geno(vcf)$GT[, "NORMAL"] : subscript out of bounds

thank you very much !

vcf • 995 views
ADD COMMENTlink modified 2.1 years ago by RamRS27k • written 2.1 years ago by Bogdan1000
1

What tools are you using? "R/BioC" have multiple tools. Can you please give us the exact commands/scripts you're using?

ADD REPLYlink written 2.1 years ago by RamRS27k
1

Dear gentlemen, thank you for your replies and help. Here I am posting the R code that I am using to do the filtering of a VCF file that is generated by DELLY and contains INVERSIONs only .. Any suggestions on how to post/link the input VCF file would be welcome. Thank you very much !

library(devtools)
library(stringr)
library(dplyr)
library(ggplot2)

library(VCFWrenchR)
library(VariantAnnotation)
library(VariantTools)

library(GenomicRanges)
library(GenomicFeatures)
library(rtracklayer)
library(org.Hs.eg.db)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(StructuralVariantAnnotation) 


vcf <- readVcf("vcf.INV3.vcf", 
                genome="hg38")

name <- "vcf.INV3.vcf"

TUMOR  <-  samples(header(vcf))[1] 
NORMAL <-  samples(header(vcf))[2]


DEPTH_threshold <- 8
FRACTION_threshold <- 0.2 



PASS_IMPRECISE_filters = function(x) { DV_germline <- ( geno(vcf)$DV[,NORMAL] < 1 )
                                       RV_germline <- ( geno(vcf)$RV[,NORMAL] < 1 )

                                       DV_tumor <- geno(vcf)$DV[,TUMOR]
                                       RV_tumor <- geno(vcf)$RV[,TUMOR]  

                                       DR_tumor <- geno(vcf)$DR[,TUMOR]
                                       RR_tumor <- geno(vcf)$RR[,TUMOR] 

                                       AD <-  (DV_tumor > DEPTH_threshold)      
                                       AF <- ((DV_tumor / (DV_tumor + DR_tumor)) > FRACTION_threshold)

                                       DV_germline &
                                       RV_germline & 
                                       AD &
                                       AF & 
                                       (filt(vcf) == "PASS") & 
                                       (info(vcf)$IMPRECISE) 
                                      }

vcf_PASS_IMPRECISE_FILTERS <- vcf[PASS_IMPRECISE_filters(vcf)]

writeVcf(vcf_PASS_IMPRECISE_FILTERS,
         file=paste(name, ".filter.PASS.IMPRECISE.vcf", sep=""))



PASS_PRECISE_filters = function(x) {   DV_germline <- ( geno(vcf)$DV[,NORMAL] < 1 )
                                       RV_germline <- ( geno(vcf)$RV[,NORMAL] < 1 )

                                       DV_tumor <- geno(vcf)$DV[,TUMOR]
                                       RV_tumor <- geno(vcf)$RV[,TUMOR]  

                                       DR_tumor <- geno(vcf)$DR[,TUMOR]
                                       RR_tumor <- geno(vcf)$RR[,TUMOR] 

                                       AD <- ((DV_tumor + RV_tumor) > DEPTH_threshold)      
                                       AF <- ((RV_tumor / (RV_tumor + RR_tumor)) > FRACTION_threshold) 

                                       DV_germline &
                                       RV_germline & 
                                       AD &
                                       AF & 
                                       (filt(vcf) == "PASS") & 
                                       (info(vcf)$PRECISE) 
                                    }


vcf_PASS_PRECISE_FILTERS <- vcf[PASS_PRECISE_filters(vcf)]

writeVcf(vcf_PASS_PRECISE_FILTERS, 
         file=paste(name, ".filter.PASS.PRECISE.vcf", sep=""))


vcf_PASS_COMBINED_FILTERS <- rbind(vcf_PASS_IMPRECISE_FILTERS, vcf_PASS_PRECISE_FILTERS) 

writeVcf(vcf_PASS_COMBINED_FILTERS, 
         file=paste(name, ".filter.PASS.PRECISE.and.IMPRECISE.and.AD.and.AF.vcf", sep=""))
ADD REPLYlink written 2.1 years ago by Bogdan1000

Can you give some background on how the VCF was generated?

ADD REPLYlink written 2.1 years ago by ATpoint35k

Please post the program you are using and the sample vcf file.

ADD REPLYlink written 2.1 years ago by cpad011213k

I will add to the comments from the 3 other great people. R would not be my 'go to' environment for filtering a VCF. To do what you want, just use bcftools view with the following parameter:

-m/M, --min-alleles/--max-alleles <int>     minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Kevin Blighe60k
0
gravatar for Jorge Amigo
2.1 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

from your question I understand that you have 1 single vcf file with 2 samples (germline and tumor) in it, and that you want to filter out all variants where 1 or the 2 samples have an empty genotype (./.)

if this is the case, a simple grep should be enough:

grep -vP '\t\./\.' file.vcf > file.non-empty-gt.vcf
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Jorge Amigo11k

Thank you Jorge. Yes, at the end, I have done it with BCFTOOlS:

bcftools view --genotype ^miss $FILE  > "${FILE%.vcf}.fGT.vcf"

I was hoping to do everything in R, as it would be easier to assemble the entire pipeline in R (than to assemble all the scripts I have in a master bash shell .sh script ) .

ADD REPLYlink modified 2.1 years ago by RamRS27k • written 2.1 years ago by Bogdan1000
2

Unfortunately, R is not the best for all tasks. You're better off using an automation/pipelining tool that is built for interoperability. People recommend snakemake, but I am not sure how well it would fit your use case, as I'm not conversant with snakemake. For the moment, you can either use system() calls from within R or Rscript/R commands within a shell script.

ADD REPLYlink written 2.1 years ago by RamRS27k

Thank you Ram for your help. Yes, i've used the "Rscript" calls from a shell script.

ADD REPLYlink written 2.1 years ago by Bogdan1000

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 2.1 years ago by RamRS27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 891 users visited in the last hour