Question: Losing info fields using bcftools merge
0
gravatar for dayne.filer
23 months ago by
dayne.filer10
dayne.filer10 wrote:

I am attempting to merge the exome and genome VCFs from gnomAD. I have successfully paired down to the gene I am interested in, subset to the INFO fields I need, and now I am trying to merge.

I would like to sum all of the "AC", "AN" and "GC" INFO fields, including all of the population-specifc fields. Currently I am using the following command:

bcftools merge -m all -i AC:sum,AN:sum,AC_AFR:sum,AC_AMR:sum,AC_ASJ:sum,AC_EAS:sum,AC_FIN:sum,AC_NFE:sum,AC_OTH:sum,AC_SAS:sum,AN_AFR:sum,AN_AMR:sum,AN_ASJ:sum,AN_EAS:sum,AN_FIN:sum,AN_NFE:sum,AN_OTH:sum,AN_SAS:sum,GC_AFR:sum,GC_AMR:sum,GC_ASJ:sum,GC_EAS:sum,GC_FIN:sum,GC_NFE:sum,GC_OTH:sum,GC_SAS:sum,AC_raw:sum,AN_raw:sum,GC_raw:sum,GC:sum in1.vcf.bgz in2.vcf.bgz -o out.vcf

I can confirm that both input VCFs have the "AC" and "AN" fields, but after the merge, "AC" and "AN" are missing. All of the population-specific fields, eg. AC_AFR, are present. "GC" is also present. Why is the merge dropping "AC" and "AN"?

I am using: bcftools 1.9 Using htslib 1.9

The complete script:

library(Rsamtools)
rd <- file.path("inst", "raw")
dd <- file.path("data")
if (!dir.exists(dirname(rd))) dir.create(dirname(rd))
if (!dir.exists(rd)) dir.create(rd)
if (!dir.exists(dd)) dir.create(dd)

l <- "gs://gnomad-public/release/2.0.2/vcf/"
e1 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz")    
e2 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz.tbi")
g1 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz")
g2 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz.tbi")
getFile <- function(f, l) {
  if (file.exists(f)) return(TRUE)
  sub <- if (grepl("genomes", f)) "genomes" else "exomes"
  s <- sprintf("gsutil -m cp %s%s/%s %s", l, sub, basename(f), f)
  system(s)
  TRUE
}
getFile(e1, l)
getFile(e2, l)
getFile(g1, l)
getFile(g2, l)

# tpo GRCh37 coordinates (https://www.ncbi.nlm.nih.gov/gene/7173)
tpo <- "2:1417233-1547445"

## Subset down to allele and genotype count fields
infoFlds <- rownames(scanBcfHeader(e1)[[1]]$Header$INFO)
keep <- infoFlds[grep("^AC|^AN|^GC", infoFlds)]
keep <- keep[!grepl("POPMAX|Male|Female", keep)]
keepStr <- paste(paste0("INFO/", c(keep, "AS_FilterStatus")), collapse = ",")

e3 <- sub("r2.0.2.sites", "tpo", e1)
g3 <- sub("r2.0.2.sites.chr2", "tpo", g1)
fmt1 <- "bcftools annotate -r %s -x ^%s %s | bgzip -c > %s"
system(sprintf(fmt1, tpo, keepStr, e1, e3))
system(sprintf(fmt1, tpo, keepStr, g1, g3))
system(sprintf("bcftools index -t %s", e3))
system(sprintf("bcftools index -t %s", g3))

mrg <- paste(paste0(keep, ":sum"), collapse = ",")
c1 <- file.path(rd, "gnomad.tpo.vcf")
system(sprintf("bcftools merge -m all -i %s %s %s -o %s", mrg, e3, g3, c1))

The following loads the VCFs, and displays the INFO fields contained in the files. The genome data does not contain any South Asian samples.

eDat <- scanBcf(e3)
gDat <- scanBcf(g3)
cDat <- scanBcf(c1)

getInfoFlds <- function(dat) {
  sapply(strsplit(strsplit(dat$INFO[[1]], ";")[[1]], "="), "[[", 1)
}

getInfoFlds(eDat)
#  [1] "AC"              "AN"              "AC_AFR"          "AC_AMR"         
#  [5] "AC_ASJ"          "AC_EAS"          "AC_FIN"          "AC_NFE"         
#  [9] "AC_OTH"          "AC_SAS"          "AN_AFR"          "AN_AMR"         
# [13] "AN_ASJ"          "AN_EAS"          "AN_FIN"          "AN_NFE"         
# [17] "AN_OTH"          "AN_SAS"          "GC_AFR"          "GC_AMR"         
# [21] "GC_ASJ"          "GC_EAS"          "GC_FIN"          "GC_NFE"         
# [25] "GC_OTH"          "GC_SAS"          "AC_raw"          "AN_raw"         
# [29] "GC_raw"          "GC"              "AS_FilterStatus"
getInfoFlds(gDat)
#  [1] "AC"              "AN"              "GC_raw"          "AC_raw"         
#  [5] "AN_raw"          "GC"              "AC_AFR"          "AC_AMR"         
#  [9] "AC_ASJ"          "AC_EAS"          "AC_FIN"          "AC_NFE"         
# [13] "AC_OTH"          "AN_AFR"          "AN_AMR"          "AN_ASJ"         
# [17] "AN_EAS"          "AN_FIN"          "AN_NFE"          "AN_OTH"         
# [21] "AS_FilterStatus" "GC_AFR"          "GC_AMR"          "GC_ASJ"         
# [25] "GC_EAS"          "GC_FIN"          "GC_NFE"          "GC_OTH"         
getInfoFlds(cDat)
#  [1] "AC_AFR"          "AC_AMR"          "AC_ASJ"          "AC_EAS"         
#  [5] "AC_FIN"          "AC_NFE"          "AC_OTH"          "AC_raw"         
#  [9] "AN_AFR"          "AN_AMR"          "AN_ASJ"          "AN_EAS"         
# [13] "AN_FIN"          "AN_NFE"          "AN_OTH"          "AN_raw"         
# [17] "GC"              "GC_AFR"          "GC_AMR"          "GC_ASJ"         
# [21] "GC_EAS"          "GC_FIN"          "GC_NFE"          "GC_OTH"         
# [25] "GC_raw"          "AS_FilterStatus"

We can also show that each entry in the input VCFs have the AC field:

system(sprintf("bcftools view %s | grep AC= | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep -v '#' | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep AC= | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep -v '#' | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep AC= | wc -l", c1))
# 0

The South Asian (*_SAS) fields are also being lost, which is problematic:

setdiff(getInfoFlds(eDat), getInfoFlds(gDat))
# [1] "AC_SAS" "AN_SAS" "GC_SAS"
setdiff(getInfoFlds(eDat), getInfoFlds(cDat))
# [1] "AC"     "AN"     "AC_SAS" "AN_SAS" "GC_SAS"

It makes more sense they are lost, because they are not present in the genome set. However, I still need to pull that information along.

gnomad bcftools merge vcf • 901 views
ADD COMMENTlink modified 23 months ago • written 23 months ago by dayne.filer10

What happens if you omit the AC:sum,AN:sum part? Try it on the first 10 VCF entries only, so you can eyeball the results' veracity.

ADD REPLYlink written 23 months ago by RamRS27k

They are still omitted if I remove AC:sum,AN:sum. From the documentation it seems like only DP and DP4 have default behavior. It is odd that they are completely dropped though, rather than taking the value from the first file.

ADD REPLYlink modified 23 months ago • written 23 months ago by dayne.filer10

Can you edit your question and add the version of bcftools and htslib you're using please? Just paste the output of bcftools -v

ADD REPLYlink written 23 months ago by RamRS27k

Done. Happy to upload the entire script, but it requires downloading the (somewhat) large gnomAD VCFs.

ADD REPLYlink written 23 months ago by dayne.filer10

No, don't worry about it. I've worked a LOT on bcftools + gnomAD VCF files, so I am familiar with the setting.

ADD REPLYlink written 23 months ago by RamRS27k

You mention gnomAD genome and exome datasets, so I wonder if these are different samples (appropriate for bcftools merge) or the same set of samples (need a different tool).

ADD REPLYlink written 23 months ago by RamRS27k

Each file, in my understanding, has one "sample" -- which I am trying to merge together to get allele frequencies combining the genome and exome data (similar to what gnomAD displays for the browser). For example, here you can toggle including genome and exome data, which just sums over the allele count and allele number.

ADD REPLYlink modified 23 months ago • written 23 months ago by dayne.filer10

And the files e3 and g3 have the AC and AN fields? Just so we're sure merge is the one removing them.

ADD REPLYlink written 23 months ago by RamRS27k

Yes. I added some clarifying output to the post.

ADD REPLYlink written 23 months ago by dayne.filer10

I'm kinda busy so I'm unable to pay this as much attention as I'd like to. Wild suggestion: Try running bcftools from the command line, and quote all input arguments, especially the one passed to -i.

ADD REPLYlink written 23 months ago by RamRS27k

No luck. I appreciate your attempts to help. I also tried manually adding the SAS fields to the genome VCF before the merge, with no success. I thought it might exclude the "overall" values when there were subset values missing.

ADD REPLYlink written 23 months ago by dayne.filer10

Maybe give CombineVariants a shot?

ADD REPLYlink written 23 months ago by RamRS27k

Removing the _SAS fields did not fix the problem. I thought maybe the AN and AC were removed because the SAS fields didn't carry over from the exome file.

ADD REPLYlink written 23 months ago by dayne.filer10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1474 users visited in the last hour