Question

How to do coverage counting for each genomic position in R ?

3

Entering edit mode

10.1 years ago

MAPK ★ 2.1k

Hi Guys,

I am using pileup in Rsamtool from R package. I need to get the total "count" for each position. For example, for chr11, position 1643082, there are a total of 1 count of A, 3 counts of C and 18 counts of T. I want them in this order:

chr11:1643082  A:1, C:3, T:18

and so forth for position chr11, 47663948 to be shown in the result. There are thousands of different positions and I need to calculate the counts for each nucleotide.Thanks!

table1

seqnames       pos strand nucleotide count               which_label
1     chr11   1643082      +          A     1     chr11:1643082-1643082
2     chr11   1643082      -          C     3     chr11:1643082-1643082
3     chr11   1643082      +          T    15     chr11:1643082-1643082
4     chr11   1643082      -          T     3     chr11:1643082-1643082
5     chr11  47663948      +          C    16   chr11:47663948-47663948
6     chr11  47663948      -          C    11   chr11:47663948-47663948
7     chr11  47663948      +          T     2   chr11:47663948-47663948

R • 3.1k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by MAPK ★ 2.1k

Ram · Accepted Answer · 2015-06-13

9

Entering edit mode

10.1 years ago

PoGibas 5.1k

You can do this easily with data.table package.

Something like this:

df[, sum(count), by=list(seqnames,pos,nucleotide)]

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by PoGibas 5.1k

0

Entering edit mode

Thank you for your help!

ADD REPLY • link 10.1 years ago by MAPK ★ 2.1k

Ram · Accepted Answer · 2015-06-15

7

Entering edit mode

10.1 years ago

Giovanni M Dall'Olio 28k

The alternative using dplyr:

library(dplyr)
df %>%
   group_by(seqnames,pos,nucleotide) %>%
   summarise(count=sum(count))

Edit: to get the exact output you described, you have to:

> biost %>% 
    group_by(seqnames,pos,nucleotide) %>% 
    summarise(count=sum(count))%>% 
    mutate(mypos=paste(seqnames, pos, sep=':')) %>%
    group_by(mypos) %>% 
    arrange(nucleotide) %>%   # Make sure nucleotides are always sorted the same way
    summarise(counts=paste(nucleotide, count, sep=':',collapse=', '))

Source: local data frame [2 x 2]

           mypos         counts
1  chr11:1643082 A:1, C:3, T:18
2 chr11:47663948      C:27, T:2

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Thank you, Giovanni!

ADD REPLY • link 10.1 years ago by MAPK ★ 2.1k