Question

Losing info fields using bcftools merge

0

Entering edit mode

7.3 years ago

dayne.filer ▴ 10

I am attempting to merge the exome and genome VCFs from gnomAD. I have successfully paired down to the gene I am interested in, subset to the INFO fields I need, and now I am trying to merge.

I would like to sum all of the "AC", "AN" and "GC" INFO fields, including all of the population-specifc fields. Currently I am using the following command:

bcftools merge -m all -i AC:sum,AN:sum,AC_AFR:sum,AC_AMR:sum,AC_ASJ:sum,AC_EAS:sum,AC_FIN:sum,AC_NFE:sum,AC_OTH:sum,AC_SAS:sum,AN_AFR:sum,AN_AMR:sum,AN_ASJ:sum,AN_EAS:sum,AN_FIN:sum,AN_NFE:sum,AN_OTH:sum,AN_SAS:sum,GC_AFR:sum,GC_AMR:sum,GC_ASJ:sum,GC_EAS:sum,GC_FIN:sum,GC_NFE:sum,GC_OTH:sum,GC_SAS:sum,AC_raw:sum,AN_raw:sum,GC_raw:sum,GC:sum in1.vcf.bgz in2.vcf.bgz -o out.vcf

I can confirm that both input VCFs have the "AC" and "AN" fields, but after the merge, "AC" and "AN" are missing. All of the population-specific fields, eg. AC_AFR, are present. "GC" is also present. Why is the merge dropping "AC" and "AN"?

I am using: bcftools 1.9 Using htslib 1.9

The complete script:

library(Rsamtools)
rd <- file.path("inst", "raw")
dd <- file.path("data")
if (!dir.exists(dirname(rd))) dir.create(dirname(rd))
if (!dir.exists(rd)) dir.create(rd)
if (!dir.exists(dd)) dir.create(dd)

l <- "gs://gnomad-public/release/2.0.2/vcf/"
e1 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz")    
e2 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz.tbi")
g1 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz")
g2 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz.tbi")
getFile <- function(f, l) {
  if (file.exists(f)) return(TRUE)
  sub <- if (grepl("genomes", f)) "genomes" else "exomes"
  s <- sprintf("gsutil -m cp %s%s/%s %s", l, sub, basename(f), f)
  system(s)
  TRUE
}
getFile(e1, l)
getFile(e2, l)
getFile(g1, l)
getFile(g2, l)

# tpo GRCh37 coordinates (https://www.ncbi.nlm.nih.gov/gene/7173)
tpo <- "2:1417233-1547445"

## Subset down to allele and genotype count fields
infoFlds <- rownames(scanBcfHeader(e1)[[1]]$Header$INFO)
keep <- infoFlds[grep("^AC|^AN|^GC", infoFlds)]
keep <- keep[!grepl("POPMAX|Male|Female", keep)]
keepStr <- paste(paste0("INFO/", c(keep, "AS_FilterStatus")), collapse = ",")

e3 <- sub("r2.0.2.sites", "tpo", e1)
g3 <- sub("r2.0.2.sites.chr2", "tpo", g1)
fmt1 <- "bcftools annotate -r %s -x ^%s %s | bgzip -c > %s"
system(sprintf(fmt1, tpo, keepStr, e1, e3))
system(sprintf(fmt1, tpo, keepStr, g1, g3))
system(sprintf("bcftools index -t %s", e3))
system(sprintf("bcftools index -t %s", g3))

mrg <- paste(paste0(keep, ":sum"), collapse = ",")
c1 <- file.path(rd, "gnomad.tpo.vcf")
system(sprintf("bcftools merge -m all -i %s %s %s -o %s", mrg, e3, g3, c1))

The following loads the VCFs, and displays the INFO fields contained in the files. The genome data does not contain any South Asian samples.

eDat <- scanBcf(e3)
gDat <- scanBcf(g3)
cDat <- scanBcf(c1)

getInfoFlds <- function(dat) {
  sapply(strsplit(strsplit(dat$INFO[[1]], ";")[[1]], "="), "[[", 1)
}

getInfoFlds(eDat)
#  [1] "AC"              "AN"              "AC_AFR"          "AC_AMR"         
#  [5] "AC_ASJ"          "AC_EAS"          "AC_FIN"          "AC_NFE"         
#  [9] "AC_OTH"          "AC_SAS"          "AN_AFR"          "AN_AMR"         
# [13] "AN_ASJ"          "AN_EAS"          "AN_FIN"          "AN_NFE"         
# [17] "AN_OTH"          "AN_SAS"          "GC_AFR"          "GC_AMR"         
# [21] "GC_ASJ"          "GC_EAS"          "GC_FIN"          "GC_NFE"         
# [25] "GC_OTH"          "GC_SAS"          "AC_raw"          "AN_raw"         
# [29] "GC_raw"          "GC"              "AS_FilterStatus"
getInfoFlds(gDat)
#  [1] "AC"              "AN"              "GC_raw"          "AC_raw"         
#  [5] "AN_raw"          "GC"              "AC_AFR"          "AC_AMR"         
#  [9] "AC_ASJ"          "AC_EAS"          "AC_FIN"          "AC_NFE"         
# [13] "AC_OTH"          "AN_AFR"          "AN_AMR"          "AN_ASJ"         
# [17] "AN_EAS"          "AN_FIN"          "AN_NFE"          "AN_OTH"         
# [21] "AS_FilterStatus" "GC_AFR"          "GC_AMR"          "GC_ASJ"         
# [25] "GC_EAS"          "GC_FIN"          "GC_NFE"          "GC_OTH"         
getInfoFlds(cDat)
#  [1] "AC_AFR"          "AC_AMR"          "AC_ASJ"          "AC_EAS"         
#  [5] "AC_FIN"          "AC_NFE"          "AC_OTH"          "AC_raw"         
#  [9] "AN_AFR"          "AN_AMR"          "AN_ASJ"          "AN_EAS"         
# [13] "AN_FIN"          "AN_NFE"          "AN_OTH"          "AN_raw"         
# [17] "GC"              "GC_AFR"          "GC_AMR"          "GC_ASJ"         
# [21] "GC_EAS"          "GC_FIN"          "GC_NFE"          "GC_OTH"         
# [25] "GC_raw"          "AS_FilterStatus"

We can also show that each entry in the input VCFs have the AC field:

system(sprintf("bcftools view %s | grep AC= | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep -v '#' | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep AC= | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep -v '#' | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep AC= | wc -l", c1))
# 0

The South Asian (*_SAS) fields are also being lost, which is problematic:

setdiff(getInfoFlds(eDat), getInfoFlds(gDat))
# [1] "AC_SAS" "AN_SAS" "GC_SAS"
setdiff(getInfoFlds(eDat), getInfoFlds(cDat))
# [1] "AC"     "AN"     "AC_SAS" "AN_SAS" "GC_SAS"

It makes more sense they are lost, because they are not present in the genome set. However, I still need to pull that information along.

bcftools merge vcf gnomAD • 3.8k views

ADD COMMENT • link 7.3 years ago by dayne.filer ▴ 10

0

Entering edit mode

What happens if you omit the AC:sum,AN:sum part? Try it on the first 10 VCF entries only, so you can eyeball the results' veracity.

ADD REPLY • link 7.3 years ago by Ram 45k

0

Entering edit mode

They are still omitted if I remove AC:sum,AN:sum. From the documentation it seems like only DP and DP4 have default behavior. It is odd that they are completely dropped though, rather than taking the value from the first file.

ADD REPLY • link 7.3 years ago by dayne.filer ▴ 10

0

Entering edit mode

Can you edit your question and add the version of bcftools and htslib you're using please? Just paste the output of bcftools -v

ADD REPLY • link 7.3 years ago by Ram 45k

0

Entering edit mode

Done. Happy to upload the entire script, but it requires downloading the (somewhat) large gnomAD VCFs.

ADD REPLY • link 7.3 years ago by dayne.filer ▴ 10

0

Entering edit mode

No, don't worry about it. I've worked a LOT on bcftools + gnomAD VCF files, so I am familiar with the setting.

ADD REPLY • link 7.3 years ago by Ram 45k

0

Entering edit mode

You mention gnomAD genome and exome datasets, so I wonder if these are different samples (appropriate for bcftools merge) or the same set of samples (need a different tool).

ADD REPLY • link 7.3 years ago by Ram 45k

0

Entering edit mode

Each file, in my understanding, has one "sample" -- which I am trying to merge together to get allele frequencies combining the genome and exome data (similar to what gnomAD displays for the browser). For example, here you can toggle including genome and exome data, which just sums over the allele count and allele number.

ADD REPLY • link 7.3 years ago by dayne.filer ▴ 10

0

Entering edit mode

And the files e3 and g3 have the AC and AN fields? Just so we're sure merge is the one removing them.

ADD REPLY • link 7.3 years ago by Ram 45k

0

Entering edit mode

Yes. I added some clarifying output to the post.

ADD REPLY • link 7.3 years ago by dayne.filer ▴ 10

0

Entering edit mode

I'm kinda busy so I'm unable to pay this as much attention as I'd like to. Wild suggestion: Try running bcftools from the command line, and quote all input arguments, especially the one passed to -i.

ADD REPLY • link 7.3 years ago by Ram 45k

0

Entering edit mode

No luck. I appreciate your attempts to help. I also tried manually adding the SAS fields to the genome VCF before the merge, with no success. I thought it might exclude the "overall" values when there were subset values missing.

ADD REPLY • link 7.3 years ago by dayne.filer ▴ 10

0

Entering edit mode

Maybe give CombineVariants a shot?

ADD REPLY • link 7.3 years ago by Ram 45k

0

Entering edit mode

Removing the _SAS fields did not fix the problem. I thought maybe the AN and AC were removed because the SAS fields didn't carry over from the exome file.

ADD REPLY • link 7.3 years ago by dayne.filer ▴ 10