Question: Extract specific genes from VCF file?
gravatar for MAPK
4.3 years ago by
MAPK1.6k wrote:

I have a list of 3000 genes and their start and end locations as shown below. I need to extract the SNPs from VCF file within these locations. What would be the easiest way to do this (or preferably in R)?

   gene_id symbol chromosome start_location end_location
1  "1"     "A1BG" "19"       " 58858171"   " 58864865"
2  "10"    "NAT2" "8"        "  18248754"   "  18258723"
3  "100"   "ADA"  "20"       " 43248162"   "43280376"
4  "1000"  "CDH2" "18"       " 25530926"   " 25616549"
snp extract vcf vcf • 5.9k views
ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by MAPK1.6k

Use vcftools. Check here:

There is an option --bed where you you take columns 3,4,5 above in a simple tab delimited file without any quotes and the header name chrom, chromStart and chromEnd and it will pull out only variants within those bed file regions. So your script will be:

./vcftools --vcf vcf_file.vcf --bed bed_file_name.bed --recode --out new_vcf_file.vcf

Hope this helps.

ADD REPLYlink written 4.2 years ago by das2000sidd30
gravatar for MAPK
4.3 years ago by
MAPK1.6k wrote:

Ok I have finally managed to write this code in R where you can select the genes from biomart or sample it randomly to extract the SNPs from VCF files separated per chromosome. It is not dandy, but does the job and can be customized as needed.

#Extracting some genes and positions from Biomart
#Using biomart

#To use hg19/GrCh37
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="", path="/biomart/martservice" ,dataset="hsapiens_gene_ensembl")
# grch37 = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="", path="/biomart/martservice")
# listMarts(grch37)

ensembl<- useDataset("hsapiens_gene_ensembl",mart=ensembl)   # from ensembl using homosapien gene data


# <- getBM(attributes=c("ensembl_gene_id", "external_gene_id"),values=gene_names, mart= ensembl)
# human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# cc <- getBM(attributes = c("hgnc_symbol","chromosome_name", "start_position","end_position"),
#            filters = "hgnc_symbol", values = *,mart = human)

all.genes <- getBM(attributes=c('ensembl_gene_id','gene_biotype','hgnc_symbol','chromosome_name','start_position','end_position'),   mart = ensembl)
# "ensembl_gene_id" "gene_biotype"    "hgnc_symbol"     "chromosome_name" "start_position"  "end_position"  

##getting only protein coding genes
all.genes <- all.genes[(!(all.genes[,"hgnc_symbol"])=="" &all.genes[,"gene_biotype"]=="protein_coding" & grepl("(^[0-9]+$)|^(X|Y)$",all.genes[,"chromosome_name"]) & !duplicated(all.genes[,"hgnc_symbol"])),]

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

all.files <- list.files("/mydir")
all.files <- all.files[grepl("recalibrated.vcf$",all.files)]

##for single file; for loop go below
# bgzip("/mydir/myvcf.chr1.recalibrated.vcf", overwrite=FALSE)
# indexTabix("/mydir/myvcf.chr1.recalibrated.vcf.bgz", format="vcf")

##creating .bgz and .bgz.tbi files
# all.files <- list.files("/mydir", pattern = "*.vcf$",  full.names = TRUE)
#this bit changes the vcf to .gz format
  for (i in all.files) {
    file_gz <- paste0(i, ".gz")
    file_gz_tbi <- paste0(i, ".gz.tbi")

      bgzip(i, paste0(i, ".gz"))

      indexTabix(file_gz, format = "vcf")

#all.genes  #this is the table of all genes in bed format

all.files <-list.files("/mydir")
# all.files <- list.files("/mydir", pattern = "*.gz$",
#                         full.names = TRUE)


##Extracting the genes from VCF

# i<-2
    for(i in 1:length(all.gz.files)){ 
    print(paste0("Doing chromosome:",i))  
#if (save.all.genes[,"chromosome_name"]== sub(".*?chr(.*?)\\.recalibrated.*", "\\1", all.bgz.files[i])){

#This bit here selects only the chromosome number: sub(".*?chr(.*?)\\.recalibrated.*", "\\1", all.gz.files[i])
all.genes <- save.all.genes[save.all.genes[,"chromosome_name"]== sub(".*?chr(.*?)\\.recalibrated.*", "\\1", all.gz.files[i]),]
if(![,"chromosome_name"][1]) & all.genes[,"chromosome_name"][1]==sub(".*?chr(.*?)\\.recalibrated.*", "\\1", all.gz.files[i])){
##file.gz <- "/mydir/myvcf.chr1.recalibrated.vcf.bgz"
file.gz.tbi <- paste(file.gz, ".tbi", sep="")
  indexTabix(file.gz, format="vcf")

# start.loc <- 45959538
# end.loc <- 203047868 <- GRanges("chr1", IRanges(start.loc, end.loc)) <- GRanges(paste0("chr",all.genes$chromosome_name), IRanges( all.genes$start_position, all.genes$end_position))

params <- ScanVcfParam(
vcf <- readVcf(TabixFile(file.gz), "hg19", params)
#locateVariants(vcf[6,], txdb, AllVariants())
##Writing the vcf subset
writeVcf(vcf, paste0("chr.",sub(".*?chr(.*?)\\.recalibrated.*", "\\1", all.gz.files[i]),".test-sub.vcf"))
ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by MAPK1.6k

Have you checked your VCF's chromosomes' names? Do they have a "chr" as prefix?

ADD REPLYlink written 4.3 years ago by michael.ante3.6k

That was one consideration (in addition to the negative position values), but difficult to assess without examples of the files.

ADD REPLYlink written 4.3 years ago by harold.smith.tarheel4.6k

Yes that was one of the issues I had. Thanks.

ADD REPLYlink written 4.3 years ago by MAPK1.6k

How about including a bit of your gene list after awk/perl, and a slice of your VCF, so that we can help you troubleshoot?

ADD REPLYlink written 4.3 years ago by harold.smith.tarheel4.6k
gravatar for michael.ante
4.3 years ago by
michael.ante3.6k wrote:

Hi Mapk,

I would use awk or perl to create a bed file out of your table. After sorting and merging the bed file, you can use intersectBed with your VCF file.



ADD COMMENTlink written 4.3 years ago by michael.ante3.6k

without being entirely clear on what output format the OP is trying to get, I would go even simpler and try a simple grep with a regular expression, something like

grep -E "SOX9|FOXA1|AR" variant_file.txt > output.txt

R has a grep function with roughly the same functionality.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by steve2.6k

I am also interested to extract information for a list of genes from vcf file and tried the above grep command but it did not work and got empty file.

VCF file has been created from bam (exomes) (via samtools mpileup -> bcftools ->

Any other ways please.


ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by bioinfo8120

For that command to work, the gene ID's must be contained in the file; I do not think this is typically the case for a VCF file. You will want to make sure that your VCF is annotated first. Might help to open the VCF in Excel to check on its contents.

ADD REPLYlink written 3.0 years ago by steve2.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 808 users visited in the last hour