Question

How can I calculate coverage from the base frequencies file?

0

Entering edit mode

6.3 years ago

SaltedPork ▴ 170

Hi

What I have is a tab delimited file with the frequency of each base at each position in the sequence. How can I take this data and show a coverage % across the entire sequence, just one number.

sample input:

Position    A   C   G   T
1   0   0   1   0   0
2   4   0   0   0   0
3   0   5   0   0   0
4   0   0   0   5   0
5   0   15  0   0   0
6   0   0   108 0   0
7   0   0   147 0   0

coverage basefrequency frequency • 1.6k views

ADD COMMENT • link updated 6.3 years ago by cpad0112 21k • written 6.3 years ago by SaltedPork ▴ 170

0

Entering edit mode

Do you have any lines of all 0 coverage?

ADD REPLY • link 6.3 years ago by Devon Ryan 104k

0

Entering edit mode

Yes there will be. Mostly at the beginning.

ADD REPLY • link 6.3 years ago by SaltedPork ▴ 170

1

Entering edit mode

6.3 years ago

Devon Ryan 104k

awk '{if(NR>1) {tot += 1; if($2 + $3 + $4 + $5 > 0) {cov += 1}}END{print cov/tot}' input.txt

In short, count the number of (non-header) lines, noting how many have ACGT counts > 0, then divide those two values.

ADD COMMENT • link 6.3 years ago by Devon Ryan 104k

1

Entering edit mode

6.3 years ago

Nicolas Rosewick 10k

in R :

cov <- read_tsv("file.txt",col_names=c("Position","A","C","G","T"))
tot <- cov %>% select(A,C,G,T) %>% sum
cov <- 
  cov %>%
  rowwise() %>% 
  mutate(perc=sum(A,C,G,T)*100/tot)

And to visualize

ggplot(cov,aes(x=Position,y=perc))+geom_bar(stat="identity")

ADD COMMENT • link 6.3 years ago by Nicolas Rosewick 10k

score 2 · Accepted Answer · 2017-12-15

try sequence logos. Below code is for %:

cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
df=data.frame(cov[1], percentage=(apply(cov[,-1],1,sum)/sum(cov[-1]))*100)

output:

> df
  Position percentage
1        0  0.3558719
2        4  0.0000000
3        0  1.7793594
4        0  1.7793594
5        0  5.3380783
6        0 38.4341637
7        0 52.3131673

Input:

> cov
  Position  A   C G T
1        0  0   1 0 0
2        4  0   0 0 0
3        0  5   0 0 0
4        0  0   0 5 0
5        0 15   0 0 0
6        0  0 108 0 0
7        0  0 147 0 0

========================================================

library(scales)   
cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
 df=data.frame(Position=cov[,1], perc=percent(rowSums(sweep(cov[,-1],1,sum(cov[-1]),"/"))))

output:

> df
  Position  perc
1        0  0.4%
2        4  0.0%
3        0  1.8%
4        0  1.8%
5        0  5.3%
6        0 38.4%
7        0 52.3%