How to compute physical per-base coverage from WGS paired-end reads?
4
1
Entering edit mode
6.9 years ago
Christian ★ 3.0k

Physical (or fragment) coverage means counting also the unsequenced part of the fragment between properly mapped read pairs. As output I'd like to get a frequency table that tells me how many bases are covered how many times. I could not find an easy solution for this problem using standard tools only (SAMtools, BEDtools, GATK).

EDIT: The reason why I am interested in physical coverage is because I think this is quite a useful metric when it comes to assess the power of a WGS data set to detect structural variations.

coverage WGS • 6.3k views
7
Entering edit mode
6.9 years ago

You can do this in three steps using bedtools. First convert your paired-end BAM file to a BEDPE formated file of fragments with bedtools bamToBed, then use cut to convert BEDPE to BED, then use bedtools genomecov to compute coverage from the resulting BED file of fragments. See this the section after NOTES in this post at the bedtools mailing list for more information: https://groups.google.com/forum/#!msg/bedtools-discuss/kulRA_7Ybow/4Be6otCybZgJ

2
Entering edit mode

I ended up following this advice, which worked quite well. I was even able to squeeze this procedure into a single command as follows:

    samtools sort -no <(samtools view -bh -f 2 -F 1792 -q 1 mybamfile.bam 20) tmp \
| bamToBed -i stdin -bedpe \
| cut -f 1,2,6 \
| sort -k 1,1 \
| bedtools genomecov \
-i stdin \
-g <(echo -e "20\t59505520\n") \
| grep ^20 \
> coverage.txt

This command first sorts the input BAM file by read names, converts properly mapped, unique, and non-duplicate read pairs to BED intervals using bamToBed, cut the relevant three columns (chromosome, start coordinate, end coordinate), sort the BED intervals by chromosome name, feed this into 'bedtools genomecov' and grep the results. Note that this solution operates only on one chromosome (chr 20 in this case), as this reduces required computation time considerably and the results should be similar to genome-wide values. As chromosome size I specified the assembly size of chromosome 20 without gaps ('N's), as one cannot have coverage in unassembled regions. The output of 'bedtools genomecov' can then be visualized to produce something like shown below.

0
Entering edit mode

Thumbs up for posting the final solution!

0
Entering edit mode

Very nice graphs Christian :)

One small note, your samtools filter flag "-F 1792" needs to be "-F 1804" to remove unmapped reads as -f 2 isn't enough, (particularly for anyone aligning with BWA).

0
Entering edit mode

Hi,

Nice work putting this together. However, I noticed a small bug when using this with my data : bamToBed -i stdin -bedpe | cut -f 1,2,6 is wrong because the highest coordinate of the read pair is not always the 6th field of the .bedpe file (even if the smallest is always the 2d field). In my case, it concerns about 0.2% of the read pairs. I guess this percentage depends of the average fragment length in your libray and whether one does adaptor trimming (I did).

bamToBed -i n_sorted.bam -bedpe | awk ' $6 <$3 {print $0}' | head -n 2 I 3346633 3346682 I 3346634 3346680 DB9GZKS1:560:H3YL3BCXY:1:1101:1207:30056 50 + - III 8154 8203 III 8156 8202 DB9GZKS1:560:H3YL3BCXY:1:1101:1224:17273 3 + -  ADD REPLY 0 Entering edit mode Try Pileup, then; it should always produce correct results. pileup.sh in=mapped.bam covstats=covstats.txt covhist=covhist.txt physicalcoverage  ADD REPLY 0 Entering edit mode Thx for your input. In the end I just wrote a little script to parse the bedpe file more robustly than cut -f 1,2,6. It returns the chromosome (field 1), min coordinate and max coordinate of the two reads. ADD REPLY 1 Entering edit mode 6.9 years ago You can look at Exercise 2 on my tutorial to see how you can get different coverages out i.e. coverages within the genome, and average coverages. I generate a frequency table at the end http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html Best Wishes, Umer ADD COMMENT 0 Entering edit mode I am specifically interested in the physical (aka fragment) coverage, not read coverage. ADD REPLY 1 Entering edit mode 6.9 years ago Fidel ★ 2.0k [update] After looking at the accepted solution I realised that I my answer will not solve the problem, it will only produce a coverage file per bp with extended reads. I recently posted in this on biostars ( Normalising ChIP-seq data to RPM using a 1 or 2 count for paired-end data? ) which is related to your question. In deepTools, there is a command called bamCoverage that can do exactly what you want. Read pairs are extended and the coverage is computed based on the fragment length. The output is either a bedGraph file or a bigWig file. ADD COMMENT 0 Entering edit mode Interesting. Can it output the per-base raw coverage without any normalization? Also, for paired-end reads, is the fragment overlap counted once or twice (I got a bit confused on this one reading the discussion you linked)? ADD REPLY 0 Entering edit mode Yes, base pair resolution without normalisation is possible: $ bamCoverage -b file.bam --binSize 1 -o file.bw

To count the overlap only once add --samFlag 64 (such that only first mates are considered).  By default, both read mates are extended.

0
Entering edit mode
6.9 years ago

I don't know of a tool that does this, specifically.  However, you could accomplish it in a slightly different way:

Merge all of the overlapping reads, generating a file of merged reads, and another file (or pair of files) of the unmerged reads.

Fuse together all of the remaining unmerged pairs: this means reverse-complement read 2, and then concatenate the sequences.  e.g. if read 1 is AAAA and read 2 is CCAG, the fused read would be AAAACTGG.  I don't have a tool for that, sorry!

Concatenated the merged reads with the fused reads, then map all of them with BBMap, which can handle the resulting potentially long indels at the junction where the non-overlapping pairs were fused together.  Using the "covhist" flag, it will directly output the frequency of base coverage.  This will consider bases under by deletion events to be covered, so it will calculate the physical coverage including the unsequenced portion:

0
Entering edit mode
Interesting solution, thanks for sharing. However, this requires re-mapping my large WGS BAM files, which I would like to avoid at all costs. Along this line, I actually thought of creating a 'fragment' BAM out of my aligned paired-end BAM by stitching read pairs together and filling in N's for the insert. I could then use available tools to compute coverage on the new (artificially single-end) BAM. However, I fear this stitching gets complicated quickly.
1
Entering edit mode

FYI, rather than that overly-complicated method, I have now updated BBMap so that it can internally calculate physical coverage.  Just add the "physicalcoverage" flag when running it.  This can also be done from already-mapped sam/bam files like this:

pileup.sh in=mapped.bam covstats=covstats.txt covhist=covhist.txt physicalcoverage