Entering edit mode
3.9 years ago
ellieuk
▴
40
Hi,
I'd like to calculate the percentage of given genes covered by an aligned BAM at a given depth i.e. 10x.
Output expected:
Gene_name percentage_covered_at_10x
Gene1 95%
Gene2 80%
Gene3 99%
I tried: bedtools coverage -abam input.bam -b genes.bed
to just give me coverage at any depth which provides me with the fraction of non-zero depth mapped reads for each coding exon (last column). Output below:
chr1 26696403 26697540 ARID1A 0 + 431 1137 1137 1.0000000
chr1 26729650 26729863 ARID1A 1 + 212 213 213 1.0000000
chr1 26731151 26731604 ARID1A 2 + 528 453 453 1.0000000
chr1 26732675 26732792 ARID1A 3 + 70 117 117 1.0000000
chr1 26767789 26767999 ARID1A 10 + 128 210 210 1.0000000
chr1 26771118 26771326 ARID1A 11 + 117 208 208 1.0000000
chr1 26760855 26761096 ARID1A 4 + 152 241 241 1.0000000
chr1 26761383 26761473 ARID1A 5 + 56 90 90 1.0000000
chr1 26762151 26762319 ARID1A 6 + 96 168 168 1.0000000
chr1 26762972 26763285 ARID1A 7 + 253 313 313 1.0000000
chr1 26766220 26766366 ARID1A 8 + 131 146 146 1.0000000
chr1 26766456 26766566 ARID1A 9 + 194 110 110 1.0000000
chr1 26772499 26772632 ARID1A 12 + 187 133 133 1.0000000
chr1 26772811 26772987 ARID1A 13 + 98 176 176 1.0000000
chr1 26773345 26773496 ARID1A 14 + 105 151 151 1.0000000
chr1 26773579 26773717 ARID1A 15 + 242 138 138 1.0000000
chr1 26773801 26773898 ARID1A 16 + 116 97 97 1.0000000
chr1 26774328 26775220 ARID1A 17 + 817 892 892 1.0000000
chr1 26775576 26775707 ARID1A 18 + 92 131 131 1.0000000
chr1 26779022 26780756 ARID1A 19 + 1394 1734 1734 1.0000000
chr3 185204415 185204757 EHHADH 1 - 206 342 342 1.0000000
chr3 185192225 185193487 EHHADH 0 - 1192 1262 1262 1.0000000
chr3 185218135 185218240 EHHADH 2 - 154 105 105 1.0000000
chr3 185229431 185229543 EHHADH 3 - 32 112 112 1.0000000
chr3 185235289 185235462 EHHADH 4 - 92 173 173 1.0000000
chr3 185248413 185248517 EHHADH 5 - 65 104 104 1.0000000
chr3 185253948 185254022 EHHADH 6 - 76 74 74 1.0000000
chr7 66629064 66629208 KCTD7 0 + 85 144 144 1.0000000
chr7 66633274 66633444 KCTD7 1 + 97 170 170 1.0000000
chr7 66638252 66638431 KCTD7 2 + 199 179 179 1.0000000
chr7 66638855 66639232 KCTD7 3 + 268 377 377 1.0000000
However, I want to be a bit smarter and factor in the following:
- I'd like the output to be per (coding regions) of the gene as a whole (NB: my input .bed file already has coding exon start and stop positions for each gene as shown by multiple lines)
- Only include coverage when depth >10 or any other desired number
If anyone has any ideas, I'd be very grateful! Please bear in mind I'm a clinician not an informatician! E
This can be done by piping the output of
bedtools
into another tool, e.g.awk
:$NF
means "last column" (theNF
th column ($
), whereNF
= "number of fields" or "total number of 'columns' in your file")$0
means = print all fieldsThat being said, I'm not sure if the last column is really what you're interested in, but I haven't used the coverage tools in a while, I would just caution you to double-check your assumptions about the results.
Thanks for the response. That doesn’t work because I need to only count a gene as covered when the read depth is 10x at any position across that gene.