Question: How can I calculate coverage from the base frequencies file?
0
gravatar for SaltedPork
6 months ago by
SaltedPork70
SaltedPork70 wrote:

Hi

What I have is a tab delimited file with the frequency of each base at each position in the sequence. How can I take this data and show a coverage % across the entire sequence, just one number.

sample input:

Position    A   C   G   T
1   0   0   1   0   0
2   4   0   0   0   0
3   0   5   0   0   0
4   0   0   0   5   0
5   0   15  0   0   0
6   0   0   108 0   0
7   0   0   147 0   0
ADD COMMENTlink modified 6 months ago by cpad01126.3k • written 6 months ago by SaltedPork70

Do you have any lines of all 0 coverage?

ADD REPLYlink written 6 months ago by Devon Ryan81k

Yes there will be. Mostly at the beginning.

ADD REPLYlink written 6 months ago by SaltedPork70
2
gravatar for cpad0112
6 months ago by
cpad01126.3k
cpad01126.3k wrote:

try sequence logos. Below code is for %:

cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
df=data.frame(cov[1], percentage=(apply(cov[,-1],1,sum)/sum(cov[-1]))*100)

output:

> df
  Position percentage
1        0  0.3558719
2        4  0.0000000
3        0  1.7793594
4        0  1.7793594
5        0  5.3380783
6        0 38.4341637
7        0 52.3131673

Input:

> cov
  Position  A   C G T
1        0  0   1 0 0
2        4  0   0 0 0
3        0  5   0 0 0
4        0  0   0 5 0
5        0 15   0 0 0
6        0  0 108 0 0
7        0  0 147 0 0

========================================================

library(scales)   
cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
 df=data.frame(Position=cov[,1], perc=percent(rowSums(sweep(cov[,-1],1,sum(cov[-1]),"/"))))

output:

> df
  Position  perc
1        0  0.4%
2        4  0.0%
3        0  1.8%
4        0  1.8%
5        0  5.3%
6        0 38.4%
7        0 52.3%
ADD COMMENTlink modified 6 months ago • written 6 months ago by cpad01126.3k
1
gravatar for Devon Ryan
6 months ago by
Devon Ryan81k
Freiburg, Germany
Devon Ryan81k wrote:
awk '{if(NR>1) {tot += 1; if($2 + $3 + $4 + $5 > 0) {cov += 1}}END{print cov/tot}' input.txt

In short, count the number of (non-header) lines, noting how many have ACGT counts > 0, then divide those two values.

ADD COMMENTlink written 6 months ago by Devon Ryan81k
1
gravatar for Nicolas Rosewick
6 months ago by
Belgium, Brussels
Nicolas Rosewick6.3k wrote:

in R :

cov <- read_tsv("file.txt",col_names=c("Position","A","C","G","T"))
tot <- cov %>% select(A,C,G,T) %>% sum
cov <- 
  cov %>%
  rowwise() %>% 
  mutate(perc=sum(A,C,G,T)*100/tot)

And to visualize

ggplot(cov,aes(x=Position,y=perc))+geom_bar(stat="identity")
ADD COMMENTlink written 6 months ago by Nicolas Rosewick6.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 599 users visited in the last hour