I am attempting to merge the exome and genome VCFs from gnomAD. I have successfully paired down to the gene I am interested in, subset to the INFO fields I need, and now I am trying to merge.
I would like to sum all of the "AC", "AN" and "GC" INFO fields, including all of the population-specifc fields. Currently I am using the following command:
bcftools merge -m all -i AC:sum,AN:sum,AC_AFR:sum,AC_AMR:sum,AC_ASJ:sum,AC_EAS:sum,AC_FIN:sum,AC_NFE:sum,AC_OTH:sum,AC_SAS:sum,AN_AFR:sum,AN_AMR:sum,AN_ASJ:sum,AN_EAS:sum,AN_FIN:sum,AN_NFE:sum,AN_OTH:sum,AN_SAS:sum,GC_AFR:sum,GC_AMR:sum,GC_ASJ:sum,GC_EAS:sum,GC_FIN:sum,GC_NFE:sum,GC_OTH:sum,GC_SAS:sum,AC_raw:sum,AN_raw:sum,GC_raw:sum,GC:sum in1.vcf.bgz in2.vcf.bgz -o out.vcf
I can confirm that both input VCFs have the "AC" and "AN" fields, but after the merge, "AC" and "AN" are missing. All of the population-specific fields, eg. AC_AFR, are present. "GC" is also present. Why is the merge dropping "AC" and "AN"?
I am using: bcftools 1.9 Using htslib 1.9
The complete script:
library(Rsamtools)
rd <- file.path("inst", "raw")
dd <- file.path("data")
if (!dir.exists(dirname(rd))) dir.create(dirname(rd))
if (!dir.exists(rd)) dir.create(rd)
if (!dir.exists(dd)) dir.create(dd)
l <- "gs://gnomad-public/release/2.0.2/vcf/"
e1 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz")
e2 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz.tbi")
g1 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz")
g2 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz.tbi")
getFile <- function(f, l) {
if (file.exists(f)) return(TRUE)
sub <- if (grepl("genomes", f)) "genomes" else "exomes"
s <- sprintf("gsutil -m cp %s%s/%s %s", l, sub, basename(f), f)
system(s)
TRUE
}
getFile(e1, l)
getFile(e2, l)
getFile(g1, l)
getFile(g2, l)
# tpo GRCh37 coordinates (https://www.ncbi.nlm.nih.gov/gene/7173)
tpo <- "2:1417233-1547445"
## Subset down to allele and genotype count fields
infoFlds <- rownames(scanBcfHeader(e1)[[1]]$Header$INFO)
keep <- infoFlds[grep("^AC|^AN|^GC", infoFlds)]
keep <- keep[!grepl("POPMAX|Male|Female", keep)]
keepStr <- paste(paste0("INFO/", c(keep, "AS_FilterStatus")), collapse = ",")
e3 <- sub("r2.0.2.sites", "tpo", e1)
g3 <- sub("r2.0.2.sites.chr2", "tpo", g1)
fmt1 <- "bcftools annotate -r %s -x ^%s %s | bgzip -c > %s"
system(sprintf(fmt1, tpo, keepStr, e1, e3))
system(sprintf(fmt1, tpo, keepStr, g1, g3))
system(sprintf("bcftools index -t %s", e3))
system(sprintf("bcftools index -t %s", g3))
mrg <- paste(paste0(keep, ":sum"), collapse = ",")
c1 <- file.path(rd, "gnomad.tpo.vcf")
system(sprintf("bcftools merge -m all -i %s %s %s -o %s", mrg, e3, g3, c1))
The following loads the VCFs, and displays the INFO fields contained in the files. The genome data does not contain any South Asian samples.
eDat <- scanBcf(e3)
gDat <- scanBcf(g3)
cDat <- scanBcf(c1)
getInfoFlds <- function(dat) {
sapply(strsplit(strsplit(dat$INFO[[1]], ";")[[1]], "="), "[[", 1)
}
getInfoFlds(eDat)
# [1] "AC" "AN" "AC_AFR" "AC_AMR"
# [5] "AC_ASJ" "AC_EAS" "AC_FIN" "AC_NFE"
# [9] "AC_OTH" "AC_SAS" "AN_AFR" "AN_AMR"
# [13] "AN_ASJ" "AN_EAS" "AN_FIN" "AN_NFE"
# [17] "AN_OTH" "AN_SAS" "GC_AFR" "GC_AMR"
# [21] "GC_ASJ" "GC_EAS" "GC_FIN" "GC_NFE"
# [25] "GC_OTH" "GC_SAS" "AC_raw" "AN_raw"
# [29] "GC_raw" "GC" "AS_FilterStatus"
getInfoFlds(gDat)
# [1] "AC" "AN" "GC_raw" "AC_raw"
# [5] "AN_raw" "GC" "AC_AFR" "AC_AMR"
# [9] "AC_ASJ" "AC_EAS" "AC_FIN" "AC_NFE"
# [13] "AC_OTH" "AN_AFR" "AN_AMR" "AN_ASJ"
# [17] "AN_EAS" "AN_FIN" "AN_NFE" "AN_OTH"
# [21] "AS_FilterStatus" "GC_AFR" "GC_AMR" "GC_ASJ"
# [25] "GC_EAS" "GC_FIN" "GC_NFE" "GC_OTH"
getInfoFlds(cDat)
# [1] "AC_AFR" "AC_AMR" "AC_ASJ" "AC_EAS"
# [5] "AC_FIN" "AC_NFE" "AC_OTH" "AC_raw"
# [9] "AN_AFR" "AN_AMR" "AN_ASJ" "AN_EAS"
# [13] "AN_FIN" "AN_NFE" "AN_OTH" "AN_raw"
# [17] "GC" "GC_AFR" "GC_AMR" "GC_ASJ"
# [21] "GC_EAS" "GC_FIN" "GC_NFE" "GC_OTH"
# [25] "GC_raw" "AS_FilterStatus"
We can also show that each entry in the input VCFs have the AC field:
system(sprintf("bcftools view %s | grep AC= | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep -v '#' | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep AC= | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep -v '#' | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep AC= | wc -l", c1))
# 0
The South Asian (*_SAS) fields are also being lost, which is problematic:
setdiff(getInfoFlds(eDat), getInfoFlds(gDat))
# [1] "AC_SAS" "AN_SAS" "GC_SAS"
setdiff(getInfoFlds(eDat), getInfoFlds(cDat))
# [1] "AC" "AN" "AC_SAS" "AN_SAS" "GC_SAS"
It makes more sense they are lost, because they are not present in the genome set. However, I still need to pull that information along.
What happens if you omit the
AC:sum,AN:sumpart? Try it on the first 10 VCF entries only, so you can eyeball the results' veracity.They are still omitted if I remove
AC:sum,AN:sum. From the documentation it seems like only DP and DP4 have default behavior. It is odd that they are completely dropped though, rather than taking the value from the first file.Can you edit your question and add the version of
bcftoolsandhtslibyou're using please? Just paste the output ofbcftools -vDone. Happy to upload the entire script, but it requires downloading the (somewhat) large gnomAD VCFs.
No, don't worry about it. I've worked a LOT on bcftools + gnomAD VCF files, so I am familiar with the setting.
You mention gnomAD genome and exome datasets, so I wonder if these are different samples (appropriate for
bcftools merge) or the same set of samples (need a different tool).Each file, in my understanding, has one "sample" -- which I am trying to merge together to get allele frequencies combining the genome and exome data (similar to what gnomAD displays for the browser). For example, here you can toggle including genome and exome data, which just sums over the allele count and allele number.
And the files
e3andg3have theACandANfields? Just so we're suremergeis the one removing them.Yes. I added some clarifying output to the post.
I'm kinda busy so I'm unable to pay this as much attention as I'd like to. Wild suggestion: Try running bcftools from the command line, and quote all input arguments, especially the one passed to
-i.No luck. I appreciate your attempts to help. I also tried manually adding the
SASfields to the genome VCF before the merge, with no success. I thought it might exclude the "overall" values when there were subset values missing.Maybe give CombineVariants a shot?
Removing the
_SASfields did not fix the problem. I thought maybe theANandACwere removed because theSASfields didn't carry over from the exome file.