Question: How can I calculate coverage from the base frequencies file?
0
SaltedPork100 wrote:

Hi

What I have is a tab delimited file with the frequency of each base at each position in the sequence. How can I take this data and show a coverage % across the entire sequence, just one number.

sample input:

``````Position    A   C   G   T
1   0   0   1   0   0
2   4   0   0   0   0
3   0   5   0   0   0
4   0   0   0   5   0
5   0   15  0   0   0
6   0   0   108 0   0
7   0   0   147 0   0
``````
modified 21 months ago by cpad011212k • written 21 months ago by SaltedPork100

Do you have any lines of all 0 coverage?

Yes there will be. Mostly at the beginning.

2

try sequence logos. Below code is for %:

``````cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
df=data.frame(cov, percentage=(apply(cov[,-1],1,sum)/sum(cov[-1]))*100)
``````

output:

``````> df
Position percentage
1        0  0.3558719
2        4  0.0000000
3        0  1.7793594
4        0  1.7793594
5        0  5.3380783
6        0 38.4341637
7        0 52.3131673
``````

Input:

``````> cov
Position  A   C G T
1        0  0   1 0 0
2        4  0   0 0 0
3        0  5   0 0 0
4        0  0   0 5 0
5        0 15   0 0 0
6        0  0 108 0 0
7        0  0 147 0 0
``````

========================================================

``````library(scales)
cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
df=data.frame(Position=cov[,1], perc=percent(rowSums(sweep(cov[,-1],1,sum(cov[-1]),"/"))))
``````

output:

``````> df
Position  perc
1        0  0.4%
2        4  0.0%
3        0  1.8%
4        0  1.8%
5        0  5.3%
6        0 38.4%
7        0 52.3%
``````
1
Devon Ryan92k wrote:
``````awk '{if(NR>1) {tot += 1; if(\$2 + \$3 + \$4 + \$5 > 0) {cov += 1}}END{print cov/tot}' input.txt
``````

In short, count the number of (non-header) lines, noting how many have ACGT counts > 0, then divide those two values.

1
Nicolas Rosewick8.1k wrote:

in R :

``````cov <- read_tsv("file.txt",col_names=c("Position","A","C","G","T"))
tot <- cov %>% select(A,C,G,T) %>% sum
cov <-
cov %>%
rowwise() %>%
mutate(perc=sum(A,C,G,T)*100/tot)
``````

And to visualize

``````ggplot(cov,aes(x=Position,y=perc))+geom_bar(stat="identity")
``````