Question: How can I calculate coverage from the base frequencies file?
0
gravatar for SaltedPork
21 months ago by
SaltedPork100
SaltedPork100 wrote:

Hi

What I have is a tab delimited file with the frequency of each base at each position in the sequence. How can I take this data and show a coverage % across the entire sequence, just one number.

sample input:

Position    A   C   G   T
1   0   0   1   0   0
2   4   0   0   0   0
3   0   5   0   0   0
4   0   0   0   5   0
5   0   15  0   0   0
6   0   0   108 0   0
7   0   0   147 0   0
ADD COMMENTlink modified 21 months ago by cpad011212k • written 21 months ago by SaltedPork100

Do you have any lines of all 0 coverage?

ADD REPLYlink written 21 months ago by Devon Ryan92k

Yes there will be. Mostly at the beginning.

ADD REPLYlink written 21 months ago by SaltedPork100
2
gravatar for cpad0112
21 months ago by
cpad011212k
India
cpad011212k wrote:

try sequence logos. Below code is for %:

cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
df=data.frame(cov[1], percentage=(apply(cov[,-1],1,sum)/sum(cov[-1]))*100)

output:

> df
  Position percentage
1        0  0.3558719
2        4  0.0000000
3        0  1.7793594
4        0  1.7793594
5        0  5.3380783
6        0 38.4341637
7        0 52.3131673

Input:

> cov
  Position  A   C G T
1        0  0   1 0 0
2        4  0   0 0 0
3        0  5   0 0 0
4        0  0   0 5 0
5        0 15   0 0 0
6        0  0 108 0 0
7        0  0 147 0 0

========================================================

library(scales)   
cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
 df=data.frame(Position=cov[,1], perc=percent(rowSums(sweep(cov[,-1],1,sum(cov[-1]),"/"))))

output:

> df
  Position  perc
1        0  0.4%
2        4  0.0%
3        0  1.8%
4        0  1.8%
5        0  5.3%
6        0 38.4%
7        0 52.3%
ADD COMMENTlink modified 21 months ago • written 21 months ago by cpad011212k
1
gravatar for Devon Ryan
21 months ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:
awk '{if(NR>1) {tot += 1; if($2 + $3 + $4 + $5 > 0) {cov += 1}}END{print cov/tot}' input.txt

In short, count the number of (non-header) lines, noting how many have ACGT counts > 0, then divide those two values.

ADD COMMENTlink written 21 months ago by Devon Ryan92k
1
gravatar for Nicolas Rosewick
21 months ago by
Belgium, Brussels
Nicolas Rosewick8.1k wrote:

in R :

cov <- read_tsv("file.txt",col_names=c("Position","A","C","G","T"))
tot <- cov %>% select(A,C,G,T) %>% sum
cov <- 
  cov %>%
  rowwise() %>% 
  mutate(perc=sum(A,C,G,T)*100/tot)

And to visualize

ggplot(cov,aes(x=Position,y=perc))+geom_bar(stat="identity")
ADD COMMENTlink written 21 months ago by Nicolas Rosewick8.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 892 users visited in the last hour