Question

Is there a way to calculate breadth of coverage from a VCF?

0

Entering edit mode

2.8 years ago

beausoleilmo ▴ 610

I have a VCF (and Beagle file) which contains Genotype likelihoods. I was wondering how to calculate breadth of coverage ("the percentage of target bases that have been sequenced for a given number of times" or how much of the reference genome was sequenced in each individual) for all the individuals across my reference genome.

See (https://www.nature.com/articles/nrg3642)

This line will give the reference genome contig size:

zcat ref.genome.fna.gz  | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' > chr_size.txt

Reading this data in R and preparing it for analysis.

chr.size = readLines("chr_size.txt")
seq.od = seq(1, length(chr.size), by =2)
seq.ev = seq(2, length(chr.size), by =2)
chr.size.df = data.frame(chr = chr.size[seq.od], size = chr.size[seq.ev])
chr.size.df$chr.pos = paste(chr.size.df$chr, chr.size.df$length, sep ="_") # for merging later

So this would give me a file with all the contig length for which I could find if the VCF file contains a position for an individual:

vcftools --gzvcf samples.vcf.gz --extract-FORMAT-info GL --out samples

Note: you could filter using the "--chr" flag and make the file smaller for testing purpose.

This will give me a file with samples.GL.FORMAT with only the genotype likelihoods.

Reading that file in R and processing it so that I get a 1 if there is a sequence at a position and 0 if it is "."

gl.format = read.table("samples.GL.FORMAT.gz",header=T,comment="")
bd.gl = gl.format[,-c(1:2)]
bd.gl.01 = apply(bd.gl, 2, function(x) ifelse(x== ".",0,1))
bd.gl.01.chr.pos = cbind(gl.format[,c(1:2)],bd.gl.01)

Then from there, there might be a way to get breadth of coverage.

I reshaped the data so it's easier to work with (instead of having individuals in columns, I placed them in 1 column knowing if a genotype likelihood was present or absent (pa))

bd.gl.long = bd.gl.01.chr.pos %>% 
  pivot_longer(cols = -c("CHROM","POS"),
               names_to = "ID",
               values_to = "pa")

Just taking the sites that are present

bd.gl.long.pa = bd.gl.long[bd.gl.long$pa == 1,]
nrow(bd.gl.long.pa)
bd.gl.long.pa$CHROM = as.factor(bd.gl.long.pa$CHROM)

Since some chromosomes or contigs are too large, I subset the chromosome here.

chr.subset = "chr1"
bd.chrsub = bd.gl.long.pa[bd.gl.long.pa$CHROM == chr.subset,]

Making a variable to merge the 2 datasets.

bd.chrsub$chr.pos = paste(bd.chrsub$CHROM, bd.chrsub$POS, sep ="_")
all.pos = merge(bd.chrsub, chr.size.df, by = "chr.pos", all.x = TRUE)

Plotting the data:

plot(0, type = "n", ylim = c(0,1), main = paste("Breadth of coverage for Chr",chr.subset),
     xlim = c(0, chr.size.df[chr.size.df$chr == chr.subset,"size"]))
text(0,.55,paste("Chromosome", chr.subset), pos = 4)
segments(x0 = 0, # showing the FULL chromosome length 
         y0 = .52,
         x1 = chr.size.df[chr.size.df$chr == chr.subset,"size"], 
         y1 = .52, 
         lend = 1, lwd = 5)
id.names = unique(all.pos$ID)
pos.y = 1:length(id.names)/length(id.names)/2
for (k in 1:length(id.names)) { # add the points for all the individuals one by one. 
  points(x = all.pos[all.pos$ID == id.names[k],"POS"], 
         y = rep(pos.y[k],nrow(all.pos[all.pos$ID == id.names[k],])), 
         pch = ".",cex = 1)  
}

Making tick marks to better situate the individuals:

seq.samples = na.omit(pos.y[seq(0,300, by = 10)])
points(x = rep(0, length(seq.samples)), 
       y = seq.samples, 
       pch = "-",cex = 1)

This gives a visual representation of the breadth of coverage. Now these numbers could be used to get details about how much of each contigs was found in each individuals.

chr.size.sub = chr.size.df[chr.size.df$chr == chr.subset,"size"]
breadth.cov = all.pos %>% 
  group_by(ID) %>% 
  summarise(sum.pa = sum(pa), 
            percentage.breadth = sum.pa/chr.size.sub)
hist(breadth.cov$percentage.breadth)

I think this is working, but would like to have a second thought or a cleaner/easier solution.

VCF Coverage Beagle • 3.1k views

ADD COMMENT • link 2.8 years ago by beausoleilmo ▴ 610

1

Entering edit mode

2.8 years ago

erdiazval ▴ 110

I think this is a rather conceptual problem. If you want to get the genome-wide coverage, maybe a VCF wouldn't be the most adequate input to estimate this metric, I agree a SAM/BAM file would be a better option. That said, if the polymorphisms are abundant in your VCF, lets say a SNP/INDE every 100 bp, then you could simply write a function to perform a sliding windows analysis to estimate the median/mean number of reads mapping to lets say 1000bp windows in 100pb steps. If you think this approach is adequate, I could share a pair of functions I've written in R to parse VCF files and to perform the sliding windows analysis. It really is not a complicated task.

ADD COMMENT • link 2.8 years ago by erdiazval ▴ 110

0

Entering edit mode

That'd be cool! The only thing is that my vcf is >210GB and so would R be able to run that?

ADD REPLY • link 2.8 years ago by beausoleilmo ▴ 610

0

Entering edit mode

2.8 years ago

colindaven 7.7k

I think you'll need a BAM for this, then use something like Sambamba depth, or the tool Qualimap.

Further details

How to plot coverage and depth statistics of a bam file

ADD COMMENT • link 2.8 years ago by colindaven 7.7k

0

Entering edit mode

I was more thinking using the VCF directly as it contains the information of the sites for all individuals (missing or not). So a table could be made with sites in rows and ID in columns stating if it is present or not. Right? But how to do that?

ADD REPLY • link 2.8 years ago by beausoleilmo ▴ 610

score 1 · Accepted Answer · 2022-10-04

The example I showed in the question is for small VCFs. When doing low-coverage WGS, it seems that the VCF is so huge that it is very unpractical. It's better (at least from what I know) to do it for each sample individually and then print the summary as you want.

[...steps to produce a BAM file]

# index Bam file (if already done, skip this)
samtools index specimen_ID.bam

# Use mpileup to get the number of sites sequenced 
samtools mpileup \
  --fasta-ref my.ref.genome.fna \
  specimen_ID.bam \
  --output specimen_ID.pileup

# in the command above you could subset the mpileup output with  `--region CHROMOSOME_NAME`. 

# Group by chromosome names and count the number of sites sequenced 
awk '{print $1}' specimen_ID.pileup | uniq -c > seq.l.per.chr.spID.txt

This gives you the breadth of sequencing for each chromosome

# Get the reference genome size per chromosome or contigs
zcat my.ref.genome.fna.gz  | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' > chr_size.txt

There is surely a way to divide the output from seq.l.per.chr.spID.txt and dividing by the chromosome length for the reference genome chr_size.txt. But I haven't implemented that yet...

The example in the question was nice since it was giving a visual output of the breadth of coverage. But it might be very difficult to do with very large genomes and sequencing that covers the whole genome.