Is there a way to calculate breadth of coverage from a VCF?
3
0
Entering edit mode
2.1 years ago
beausoleilmo ▴ 600

I have a VCF (and Beagle file) which contains Genotype likelihoods. I was wondering how to calculate breadth of coverage ("the percentage of target bases that have been sequenced for a given number of times" or how much of the reference genome was sequenced in each individual) for all the individuals across my reference genome.

See (https://www.nature.com/articles/nrg3642)

This line will give the reference genome contig size:

zcat ref.genome.fna.gz  | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' > chr_size.txt

Reading this data in R and preparing it for analysis.

chr.size = readLines("chr_size.txt")
seq.od = seq(1, length(chr.size), by =2)
seq.ev = seq(2, length(chr.size), by =2)
chr.size.df = data.frame(chr = chr.size[seq.od], size = chr.size[seq.ev])
chr.size.df$chr.pos = paste(chr.size.df$chr, chr.size.df$length, sep ="_") # for merging later

So this would give me a file with all the contig length for which I could find if the VCF file contains a position for an individual:

vcftools --gzvcf samples.vcf.gz --extract-FORMAT-info GL --out samples

Note: you could filter using the "--chr" flag and make the file smaller for testing purpose.

This will give me a file with samples.GL.FORMAT with only the genotype likelihoods.

Reading that file in R and processing it so that I get a 1 if there is a sequence at a position and 0 if it is "."

gl.format = read.table("samples.GL.FORMAT.gz",header=T,comment="")
bd.gl = gl.format[,-c(1:2)]
bd.gl.01 = apply(bd.gl, 2, function(x) ifelse(x== ".",0,1))
bd.gl.01.chr.pos = cbind(gl.format[,c(1:2)],bd.gl.01)

Then from there, there might be a way to get breadth of coverage.

I reshaped the data so it's easier to work with (instead of having individuals in columns, I placed them in 1 column knowing if a genotype likelihood was present or absent (pa))

bd.gl.long = bd.gl.01.chr.pos %>% 
  pivot_longer(cols = -c("CHROM","POS"),
               names_to = "ID",
               values_to = "pa")

Just taking the sites that are present

bd.gl.long.pa = bd.gl.long[bd.gl.long$pa == 1,]
nrow(bd.gl.long.pa)
bd.gl.long.pa$CHROM = as.factor(bd.gl.long.pa$CHROM)

Since some chromosomes or contigs are too large, I subset the chromosome here.

chr.subset = "chr1"
bd.chrsub = bd.gl.long.pa[bd.gl.long.pa$CHROM == chr.subset,]

Making a variable to merge the 2 datasets.

bd.chrsub$chr.pos = paste(bd.chrsub$CHROM, bd.chrsub$POS, sep ="_")
all.pos = merge(bd.chrsub, chr.size.df, by = "chr.pos", all.x = TRUE)

Plotting the data:

plot(0, type = "n", ylim = c(0,1), main = paste("Breadth of coverage for Chr",chr.subset),
     xlim = c(0, chr.size.df[chr.size.df$chr == chr.subset,"size"]))
text(0,.55,paste("Chromosome", chr.subset), pos = 4)
segments(x0 = 0, # showing the FULL chromosome length 
         y0 = .52,
         x1 = chr.size.df[chr.size.df$chr == chr.subset,"size"], 
         y1 = .52, 
         lend = 1, lwd = 5)
id.names = unique(all.pos$ID)
pos.y = 1:length(id.names)/length(id.names)/2
for (k in 1:length(id.names)) { # add the points for all the individuals one by one. 
  points(x = all.pos[all.pos$ID == id.names[k],"POS"], 
         y = rep(pos.y[k],nrow(all.pos[all.pos$ID == id.names[k],])), 
         pch = ".",cex = 1)  
}

Making tick marks to better situate the individuals:

seq.samples = na.omit(pos.y[seq(0,300, by = 10)])
points(x = rep(0, length(seq.samples)), 
       y = seq.samples, 
       pch = "-",cex = 1)

This gives a visual representation of the breadth of coverage. Now these numbers could be used to get details about how much of each contigs was found in each individuals.

chr.size.sub = chr.size.df[chr.size.df$chr == chr.subset,"size"]
breadth.cov = all.pos %>% 
  group_by(ID) %>% 
  summarise(sum.pa = sum(pa), 
            percentage.breadth = sum.pa/chr.size.sub)
hist(breadth.cov$percentage.breadth)

I think this is working, but would like to have a second thought or a cleaner/easier solution.

VCF Coverage Beagle • 2.4k views
ADD COMMENT
1
Entering edit mode
2.1 years ago
beausoleilmo ▴ 600

The example I showed in the question is for small VCFs. When doing low-coverage WGS, it seems that the VCF is so huge that it is very unpractical. It's better (at least from what I know) to do it for each sample individually and then print the summary as you want.

[...steps to produce a BAM file]

# index Bam file (if already done, skip this)
samtools index specimen_ID.bam

# Use mpileup to get the number of sites sequenced 
samtools mpileup \
  --fasta-ref my.ref.genome.fna \
  specimen_ID.bam \
  --output specimen_ID.pileup

# in the command above you could subset the mpileup output with  `--region CHROMOSOME_NAME`. 

# Group by chromosome names and count the number of sites sequenced 
awk '{print $1}' specimen_ID.pileup | uniq -c > seq.l.per.chr.spID.txt

This gives you the breadth of sequencing for each chromosome

# Get the reference genome size per chromosome or contigs
zcat my.ref.genome.fna.gz  | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' > chr_size.txt

There is surely a way to divide the output from seq.l.per.chr.spID.txt and dividing by the chromosome length for the reference genome chr_size.txt. But I haven't implemented that yet...

The example in the question was nice since it was giving a visual output of the breadth of coverage. But it might be very difficult to do with very large genomes and sequencing that covers the whole genome.

ADD COMMENT
1
Entering edit mode
2.1 years ago
erdiazval ▴ 110

I think this is a rather conceptual problem. If you want to get the genome-wide coverage, maybe a VCF wouldn't be the most adequate input to estimate this metric, I agree a SAM/BAM file would be a better option. That said, if the polymorphisms are abundant in your VCF, lets say a SNP/INDE every 100 bp, then you could simply write a function to perform a sliding windows analysis to estimate the median/mean number of reads mapping to lets say 1000bp windows in 100pb steps. If you think this approach is adequate, I could share a pair of functions I've written in R to parse VCF files and to perform the sliding windows analysis. It really is not a complicated task.

ADD COMMENT
0
Entering edit mode

That'd be cool! The only thing is that my vcf is >210GB and so would R be able to run that?

ADD REPLY
0
Entering edit mode
2.1 years ago

I think you'll need a BAM for this, then use something like Sambamba depth, or the tool Qualimap.

Further details

How to plot coverage and depth statistics of a bam file

ADD COMMENT
0
Entering edit mode

I was more thinking using the VCF directly as it contains the information of the sites for all individuals (missing or not). So a table could be made with sites in rows and ID in columns stating if it is present or not. Right? But how to do that?

ADD REPLY

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6