How to get coverage for specific base(s), and for specific mismatch(es) per genomic range
5 weeks ago

Hi,

I would like to get coverage per a set of genomic ranges, with a little complication that I need coverage over T, G, A, and C provided separately. My idea to do it is first add feature names to mpileup file (with bedtools intersect), and then do something akin to R dyplyr::summarize(). But maybe there is bash alternative for that?

The second step of my struggle is to count specific mismatches based on the mpileup code. This I thought of doing in R, because I know how, but maybe someone could help me to get started with awk on that. A one-liner to count the number of occurrences of "g" in column 5 (see below) and print this number instead in the same column would help to get me started.

Of course, if there is a more efficient way to accomplish the task - let me know (I am sure there is)!

slam_500_spike  390 A   46  .,..,.,.g..,,g,.,,,,.g.,,,,,,,g,,....,,,,,,,.   FFFF:FFFFFFFFFJJFFFJFJJJJFJJJJFFJJJJJJJFFJJJJJ

