I would like to get coverage per a set of genomic ranges, with a little complication that I need coverage over T, G, A, and C provided separately. My idea to do it is first add feature names to mpileup file (with bedtools intersect), and then do something akin to R dyplyr::summarize(). But maybe there is bash alternative for that?
The second step of my struggle is to count specific mismatches based on the mpileup code. This I thought of doing in R, because I know how, but maybe someone could help me to get started with awk on that. A one-liner to count the number of occurrences of "g" in column 5 (see below) and print this number instead in the same column would help to get me started.
Of course, if there is a more efficient way to accomplish the task - let me know (I am sure there is)!
slam_500_spike 390 A 46 .,..,.,.g..,,g,.,,,,.g.,,,,,,,g,,....,,,,,,,. FFFF:FFFFFFFFFJJFFFJFJJJJFJJJJFFJJJJJJJFFJJJJJ