Physical (or fragment) coverage means counting also the unsequenced part of the fragment between properly mapped read pairs. As output I'd like to get a frequency table that tells me how many bases are covered how many times. I could not find an easy solution for this problem using standard tools only (SAMtools, BEDtools, GATK).
EDIT: The reason why I am interested in physical coverage is because I think this is quite a useful metric when it comes to assess the power of a WGS data set to detect structural variations.
I ended up following this advice, which worked quite well. I was even able to squeeze this procedure into a single command as follows:
This command first sorts the input BAM file by read names, converts properly mapped, unique, and non-duplicate read pairs to BED intervals using bamToBed, cut the relevant three columns (chromosome, start coordinate, end coordinate), sort the BED intervals by chromosome name, feed this into 'bedtools genomecov' and grep the results. Note that this solution operates only on one chromosome (chr 20 in this case), as this reduces required computation time considerably and the results should be similar to genome-wide values. As chromosome size I specified the assembly size of chromosome 20 without gaps ('N's), as one cannot have coverage in unassembled regions. The output of 'bedtools genomecov' can then be visualized to produce something like shown below.
Thumbs up for posting the final solution!
Very nice graphs Christian :)
One small note, your samtools filter flag
-F 1792
needs to be-F 1804
to remove unmapped reads as-f 2
isn't enough, (particularly for anyone aligning with BWA).Hi,
Nice work putting this together. However, I noticed a small bug when using this with my data :
bamToBed -i stdin -bedpe | cut -f 1,2,6
is wrong because the highest coordinate of the read pair is not always the 6th field of the .bedpe file (even if the smallest is always the 2d field). In my case, it concerns about 0.2% of the read pairs. I guess this percentage depends of the average fragment length in your libray and whether one does adaptor trimming (I did).Try Pileup, then; it should always produce correct results.
Thx for your input. In the end I just wrote a little script to parse the bedpe file more robustly than
cut -f 1,2,6
. It returns the chromosome (field 1), min coordinate and max coordinate of the two reads.