Question: How to compute physical per-base coverage from WGS paired-end reads?
1
gravatar for Christian
4.6 years ago by
Christian2.8k
Cambridge, US
Christian2.8k wrote:

Physical (or fragment) coverage means counting also the unsequenced part of the fragment between properly mapped read pairs. As output I'd like to get a frequency table that tells me how many bases are covered how many times. I could not find an easy solution for this problem using standard tools only (SAMtools, BEDtools, GATK).

EDIT: The reason why I am interested in physical coverage is because I think this is quite a useful metric when it comes to assess the power of a WGS data set to detect structural variations.

coverage wgs • 5.1k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Christian2.8k
7
gravatar for Casey Bergman
4.6 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

You can do this in three steps using bedtools. First convert your paired-end BAM file to a BEDPE formated file of fragments with bedtools bamToBed, then use cut to convert BEDPE to BED, then use bedtools genomecov to compute coverage from the resulting BED file of fragments. See this the section after NOTES in this post at the bedtools mailing list for more information: https://groups.google.com/forum/#!msg/bedtools-discuss/kulRA_7Ybow/4Be6otCybZgJ

ADD COMMENTlink written 4.6 years ago by Casey Bergman18k
2

I ended up following this advice, which worked quite well. I was even able to squeeze this procedure into a single command as follows:

    samtools sort -no <(samtools view -bh -f 2 -F 1792 -q 1 mybamfile.bam 20) tmp \
        | bamToBed -i stdin -bedpe \
        | cut -f 1,2,6 \
        | sort -k 1,1 \
        | bedtools genomecov \
            -i stdin \
            -g <(echo -e "20\t59505520\n") \
        | grep ^20 \
        > coverage.txt

This command first sorts the input BAM file by read names, converts properly mapped, unique, and non-duplicate read pairs to BED intervals using bamToBed, cut the relevant three columns (chromosome, start coordinate, end coordinate), sort the BED intervals by chromosome name, feed this into 'bedtools genomecov' and grep the results. Note that this solution operates only on one chromosome (chr 20 in this case), as this reduces required computation time considerably and the results should be similar to genome-wide values. As chromosome size I specified the assembly size of chromosome 20 without gaps ('N's), as one cannot have coverage in unassembled regions. The output of 'bedtools genomecov' can then be visualized to produce something like shown below.Template coverage

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Christian2.8k

Thumbs up for posting the final solution!

ADD REPLYlink written 4.6 years ago by george.ry1.1k

Very nice graphs Christian :)

One small note, your samtools filter flag "-F 1792" needs to be "-F 1804" to remove unmapped reads as -f 2 isn't enough, (particularly for anyone aligning with BWA).

ADD REPLYlink written 4.4 years ago by John12k

Hi,

Nice work putting this together. However, I noticed a small bug when using this with my data : bamToBed -i stdin -bedpe | cut -f 1,2,6 is wrong because the highest coordinate of the read pair is not always the 6th field of the .bedpe file (even if the smallest is always the 2d field). In my case, it concerns about 0.2% of the read pairs. I guess this percentage depends of the average fragment length in your libray and whether one does adaptor trimming (I did).

bamToBed -i n_sorted.bam -bedpe | awk ' $6 < $3 {print $0}' | head -n 2

I   3346633 3346682 I   3346634 3346680 DB9GZKS1:560:H3YL3BCXY:1:1101:1207:30056    50  +   -
III 8154    8203    III 8156    8202    DB9GZKS1:560:H3YL3BCXY:1:1101:1224:17273    3   +   -
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Carlo Yague4.6k

Try Pileup, then; it should always produce correct results.

pileup.sh in=mapped.bam covstats=covstats.txt covhist=covhist.txt physicalcoverage
ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

Thx for your input. In the end I just wrote a little script to parse the bedpe file more robustly than cut -f 1,2,6. It returns the chromosome (field 1), min coordinate and max coordinate of the two reads.

ADD REPLYlink written 2.8 years ago by Carlo Yague4.6k
1
gravatar for umer.zeeshan.ijaz
4.6 years ago by
Glasgow, UK
umer.zeeshan.ijaz1.7k wrote:

You can look at Exercise 2 on my tutorial to see how you can get different coverages out i.e. coverages within the genome, and average coverages. I generate a frequency table at the end

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html

Best Wishes,

Umer

ADD COMMENTlink written 4.6 years ago by umer.zeeshan.ijaz1.7k

I am specifically interested in the physical (aka fragment) coverage, not read coverage.

ADD REPLYlink written 4.6 years ago by Christian2.8k
1
gravatar for Fidel
4.6 years ago by
Fidel1.9k
Germany
Fidel1.9k wrote:

[update] After looking at the accepted solution I realised that I my answer will not solve the problem, it will only produce a coverage file per bp with extended reads.

I recently posted in this on biostars ( Normalising ChIP-seq data to RPM using a 1 or 2 count for paired-end data? ) which is related to your question. In deepTools, there is a command called bamCoverage that can do exactly what you want. Read pairs are extended and the coverage is computed based on the fragment length. The output is either a bedGraph file or a bigWig file.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Fidel1.9k

Interesting. Can it output the per-base raw coverage without any normalization? Also, for paired-end reads, is the fragment overlap counted once or twice (I got a bit confused on this one reading the discussion you linked)?

ADD REPLYlink written 4.6 years ago by Christian2.8k

Yes, base pair resolution without normalisation is possible:

$ bamCoverage -b file.bam --binSize 1 -o file.bw

To count the overlap only once add --samFlag 64 (such that only first mates are considered).  By default, both read mates are extended.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Fidel1.9k
0
gravatar for Brian Bushnell
4.6 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I don't know of a tool that does this, specifically.  However, you could accomplish it in a slightly different way:

Adapter-trim all of the reads.

Merge all of the overlapping reads, generating a file of merged reads, and another file (or pair of files) of the unmerged reads.

Fuse together all of the remaining unmerged pairs: this means reverse-complement read 2, and then concatenate the sequences.  e.g. if read 1 is AAAA and read 2 is CCAG, the fused read would be AAAACTGG.  I don't have a tool for that, sorry!

Concatenated the merged reads with the fused reads, then map all of them with BBMap, which can handle the resulting potentially long indels at the junction where the non-overlapping pairs were fused together.  Using the "covhist" flag, it will directly output the frequency of base coverage.  This will consider bases under by deletion events to be covered, so it will calculate the physical coverage including the unsequenced portion:

bbmap.sh in=reads.fq ref=reference.fa nodisk covhist=covhist.txt

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Brian Bushnell16k
Interesting solution, thanks for sharing. However, this requires re-mapping my large WGS BAM files, which I would like to avoid at all costs. Along this line, I actually thought of creating a 'fragment' BAM out of my aligned paired-end BAM by stitching read pairs together and filling in N's for the insert. I could then use available tools to compute coverage on the new (artificially single-end) BAM. However, I fear this stitching gets complicated quickly.
ADD REPLYlink written 4.6 years ago by Christian2.8k
1

FYI, rather than that overly-complicated method, I have now updated BBMap so that it can internally calculate physical coverage.  Just add the "physicalcoverage" flag when running it.  This can also be done from already-mapped sam/bam files like this:

pileup.sh in=mapped.bam covstats=covstats.txt covhist=covhist.txt physicalcoverage

ADD REPLYlink written 4.6 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1029 users visited in the last hour