Question: How can I calculate coverage from the base frequencies file?
0
gravatar for SaltedPork
4 weeks ago by
SaltedPork50
SaltedPork50 wrote:

Hi

What I have is a tab delimited file with the frequency of each base at each position in the sequence. How can I take this data and show a coverage % across the entire sequence, just one number.

sample input:

Position    A   C   G   T
1   0   0   1   0   0
2   4   0   0   0   0
3   0   5   0   0   0
4   0   0   0   5   0
5   0   15  0   0   0
6   0   0   108 0   0
7   0   0   147 0   0
ADD COMMENTlink modified 4 weeks ago by cpad01124.1k • written 4 weeks ago by SaltedPork50

Do you have any lines of all 0 coverage?

ADD REPLYlink written 4 weeks ago by Devon Ryan74k

Yes there will be. Mostly at the beginning.

ADD REPLYlink written 4 weeks ago by SaltedPork50
1
gravatar for Devon Ryan
4 weeks ago by
Devon Ryan74k
Freiburg, Germany
Devon Ryan74k wrote:
awk '{if(NR>1) {tot += 1; if($2 + $3 + $4 + $5 > 0) {cov += 1}}END{print cov/tot}' input.txt

In short, count the number of (non-header) lines, noting how many have ACGT counts > 0, then divide those two values.

ADD COMMENTlink written 4 weeks ago by Devon Ryan74k
1
gravatar for Nicolas Rosewick
4 weeks ago by
Belgium, Brussels, Université Libre de Bruxelles / Université de Liège
Nicolas Rosewick6.0k wrote:

in R :

cov <- read_tsv("file.txt",col_names=c("Position","A","C","G","T"))
tot <- cov %>% select(A,C,G,T) %>% sum
cov <- 
  cov %>%
  rowwise() %>% 
  mutate(perc=sum(A,C,G,T)*100/tot)

And to visualize

ggplot(cov,aes(x=Position,y=perc))+geom_bar(stat="identity")
ADD COMMENTlink written 4 weeks ago by Nicolas Rosewick6.0k
1
gravatar for cpad0112
4 weeks ago by
cpad01124.1k
cpad01124.1k wrote:

try sequence logos. Below code is for %:

cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
df=data.frame(cov[1], percentage=(apply(cov[,-1],1,sum)/sum(cov[-1]))*100)

output:

> df
  Position percentage
1        0  0.3558719
2        4  0.0000000
3        0  1.7793594
4        0  1.7793594
5        0  5.3380783
6        0 38.4341637
7        0 52.3131673

Input:

> cov
  Position  A   C G T
1        0  0   1 0 0
2        4  0   0 0 0
3        0  5   0 0 0
4        0  0   0 5 0
5        0 15   0 0 0
6        0  0 108 0 0
7        0  0 147 0 0

========================================================

library(scales)   
cov <-read.csv("test.txt", header = T, sep="\t", stringsAsFactors = F)
 df=data.frame(Position=cov[,1], perc=percent(rowSums(sweep(cov[,-1],1,sum(cov[-1]),"/"))))

output:

> df
  Position  perc
1        0  0.4%
2        4  0.0%
3        0  1.8%
4        0  1.8%
5        0  5.3%
6        0 38.4%
7        0 52.3%
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by cpad01124.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1580 users visited in the last hour