How can I calculate coverage from the base frequencies file?
3
0
Entering edit mode
6.3 years ago
SaltedPork ▴ 170

Hi

What I have is a tab delimited file with the frequency of each base at each position in the sequence. How can I take this data and show a coverage % across the entire sequence, just one number.

sample input:

Position    A   C   G   T
1   0   0   1   0   0
2   4   0   0   0   0
3   0   5   0   0   0
4   0   0   0   5   0
5   0   15  0   0   0
6   0   0   108 0   0
7   0   0   147 0   0
coverage basefrequency frequency • 1.6k views
ADD COMMENT
0
Entering edit mode

Do you have any lines of all 0 coverage?

ADD REPLY
0
Entering edit mode

Yes there will be. Mostly at the beginning.

ADD REPLY
2
Entering edit mode
6.3 years ago

try sequence logos. Below code is for %:

cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
df=data.frame(cov[1], percentage=(apply(cov[,-1],1,sum)/sum(cov[-1]))*100)

output:

> df
  Position percentage
1        0  0.3558719
2        4  0.0000000
3        0  1.7793594
4        0  1.7793594
5        0  5.3380783
6        0 38.4341637
7        0 52.3131673

Input:

> cov
  Position  A   C G T
1        0  0   1 0 0
2        4  0   0 0 0
3        0  5   0 0 0
4        0  0   0 5 0
5        0 15   0 0 0
6        0  0 108 0 0
7        0  0 147 0 0

========================================================

library(scales)   
cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
 df=data.frame(Position=cov[,1], perc=percent(rowSums(sweep(cov[,-1],1,sum(cov[-1]),"/"))))

output:

> df
  Position  perc
1        0  0.4%
2        4  0.0%
3        0  1.8%
4        0  1.8%
5        0  5.3%
6        0 38.4%
7        0 52.3%
ADD COMMENT
1
Entering edit mode
6.3 years ago
awk '{if(NR>1) {tot += 1; if($2 + $3 + $4 + $5 > 0) {cov += 1}}END{print cov/tot}' input.txt

In short, count the number of (non-header) lines, noting how many have ACGT counts > 0, then divide those two values.

ADD COMMENT
1
Entering edit mode
6.3 years ago

in R :

cov <- read_tsv("file.txt",col_names=c("Position","A","C","G","T"))
tot <- cov %>% select(A,C,G,T) %>% sum
cov <- 
  cov %>%
  rowwise() %>% 
  mutate(perc=sum(A,C,G,T)*100/tot)

And to visualize

ggplot(cov,aes(x=Position,y=perc))+geom_bar(stat="identity")
ADD COMMENT

Login before adding your answer.

Traffic: 1502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6