Reading AD and other genotype information from the VCF file
1
0
Entering edit mode
10 weeks ago

I need to get the allele depth and read depth from a vcf file, sometimes it can look like this:

GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:25,13:0.348:38:7,2:7,5:14,7:12,13,7,6

But sometimes they look like this:

DP:FDP:SDP:SUBDP:AU:CU:GU:TU 35:1:0:0:0,0:0,0:34,35:0,0 35:0:0:0:0,0:2,2:31,31:2,2

And ocassionally there are multiple alleles for the same position.

Are there R tools that help with extracting this information? I am using readVcf from the Bioconductor VariantAnnotation package, but I still need to do extra steps to compare both cases above, and I think there migth be more situations depending on the variant caller that creates the vcf file. I can do the work but I was wondering if anyone uses another package to extract allele depth for each variant from a vcf file.

Thank you in advance.

R vcf • 5.9k views
ADD COMMENT
2
Entering edit mode

bcftools query -f '[%CHROM %POS %REF %ALT %SAMPLE %GT %AD\n]'

ADD REPLY
0
Entering edit mode
1 day ago
Kevin Blighe ★ 90k

You can extract allele depth (AD) and read depth (DP) using the VariantAnnotation package that you mentioned. The readVcf function imports the VCF as a VCF object, and you can access these fields directly via the geno slot if they are present in the FORMAT section. For standard formats like GT:AD:AF:DP:F1R2:F2R1:FAD:SB, AD provides comma-separated depths for the reference and alternate alleles, while DP gives the total read depth.

Here is example code to extract them:

library(VariantAnnotation)

# Read the VCF file
vcf <- readVcf("your_file.vcf", "hg38")

# Extract allele depths (AD) as a matrix or array
ad <- geno(vcf)$AD

# Extract read depths (DP) as a matrix
dp <- geno(vcf)$DP

# View dimensions and first few entries
dim(ad)
head(ad[, 1:2])  # First variants, first two samples
dim(dp)
head(dp[, 1:2])

For non-standard formats like DP:FDP:SDP:SUBDP:AU:CU:GU:TU, these represent base-specific unfiltered and filtered counts (e.g., AU for A, with comma-separated values possibly indicating strands or qualities). The VariantAnnotation package still imports them via geno, but you must map them to AD based on the reference (REF) and alternate (ALT) alleles for each variant. Extract the relevant fields and compute allele depths manually.

Here is code to handle this:

# Extract base-specific fields
au <- geno(vcf)$AU
cu <- geno(vcf)$CU
gu <- geno(vcf)$GU
tu <- geno(vcf)$TU

# For each variant, map REF/ALT to depths (example for first sample)
ref_alt_depths <- sapply(seq_len(nrow(vcf)), function(i) {
  ref_base <- as.character(rowRanges(vcf)$REF[i])
  alt_bases <- as.character(rowRanges(vcf)$ALT[i])
  ref_depth <- switch(ref_base, A=au[i,1], C=cu[i,1], G=gu[i,1], T=tu[i,1])
  alt_depths <- sapply(alt_bases, function(alt) switch(alt, A=au[i,1], C=cu[i,1], G=gu[i,1], T=tu[i,1]))
  c(ref_depth, alt_depths)
})

This approach accommodates multiple alleles. If your VCFs vary widely, consider the vcfR package for additional flexibility in parsing FORMAT fields via extract.gt(vcf, element="AD").

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3079 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6