Question: How do I normalize this data for plotting?
0
gravatar for MAPK
7 months ago by
MAPK1.3k
United States
MAPK1.3k wrote:

Hi, I am sorry for asking this silly question, but I am really confused with the normalization of this data set. I have three fastq files for RNAseq data. I have sampleA, sampleB and sample C. Suppose, the total reads in sampleA = 5 millions, sampleB = 7 millions and sampleC = 8 millions. Now, I have calculated the nucleotide frequency in sequences with 18, 19, 20 and 21 bases long from each of these fastq file. I want to plot the frequency of these A, T,G,C in these sequences and before I do the plotting, I need to normalize the frequency data matrix.

Sample A
seq A  C  G  T
18  123344  922299 255253  832388
19  642245  454252  7424534  323444
20  133455  545543  543344 93322
21  153335 115543 1633345 213333


SampleB
    seq A  C  G  T
    18  123344  93399 235553  83382
    19  644225  245452  7442534  3311444
    20  1133455  2335543  225344 22322
    21  112335 112243 1622245 213223

Sample C
    seq A  C  G  T
    18  122222  22219 233553  343388
    19  6445  22452  722534  444212
    20  33355  545543  543344 93322
    21  22235 225543 223345 223333

So in order to normalize, do I just add the total reads in all three (i.e. 5+7+8=20 million reads) and divide all A, T,G,C of each sample ? Or Do I just divide with the total reads of each sample (for example, divide A,T,G,C columns of sample A with 5 million)? How do I get the proportional estimate of nucleotide frequency in each sample? Thank you for your help.

normalization • 251 views
ADD COMMENTlink modified 7 months ago by shenwei3564.1k • written 7 months ago by MAPK1.3k

Hi, I suppose the reads are assumed to be coming from a reference genome/ transcriptome. In that case, normalizing by mapping % might be one way.

So instead of counts of A/C/T/G, you could have proportion of A/C/T/G in mapped reads vs. unmapped reads. So, if the base composition is changing, looking at ratio helps make different sized libraries comparable.

ADD REPLYlink written 7 months ago by Amitm1.6k
2
gravatar for Devon Ryan
7 months ago by
Devon Ryan84k
Freiburg, Germany
Devon Ryan84k wrote:

Divide by the total number of reads in each sample, as is done in FastQC's output. If you need the proportion of bases per sample, then just average the aforementioned values (assuming everything is the same length).

ADD COMMENTlink written 7 months ago by Devon Ryan84k
2
gravatar for shenwei356
7 months ago by
shenwei3564.1k
China
shenwei3564.1k wrote:

Here's a way in R using dplyr and tidyr, only data of one sample is processed.

library(dplyr)
library(tidyr)
library(ggplot2) # for plot
# library(gglogo)  # for plot

df <- read.csv("SampleA.tsv", sep = "\t", header = FALSE)

df <- t(df)              # transpose
colnames(df) <- df[1, ]  # first row as colnames
rownames(df) <- NULL     # remove
df <-df[-1,]             # remove first row
df <-  as.data.frame(df) # to data frame
# seq     18      19     20      21
# 1   A 123344  642245 133455  153335
# 2   C 922299  454252 545543  115543
# 3   G 255253 7424534 543344 1633345
# 4   T 832388  323444  93322  213333

df.n <- df %>% gather(pos, count, -seq)
#    seq pos   count
# 1    A  18  123344
# 2    C  18  922299
# 3    G  18  255253
# 4    T  18  832388
# 5    A  19  642245
# 6    C  19  454252
# 7    G  19 7424534

df.n$count <- as.integer(df.n$count) # convert char to int

df.n <- df.n %>%
  group_by(pos) %>%
  mutate(prop = count / sum(count))
# seq   pos   count       prop
# <fctr> <chr>   <int>      <dbl>
# 1      A    18  123344 0.05781884
# 2      C    18  922299 0.43233765
# 3      G    18  255253 0.11965261
# 4      T    18  832388 0.39019090


p <- ggplot(df.n, aes(x = pos, y = prop, color = seq, fill = seq, label = seq)) +
  geom_bar(stat = "identity", position = "stack")
  # geom_logo(position="classic")
ggsave(p, file = "tmp.jpeg", width = 4, height = 4, dpi = 100)

ADD COMMENTlink modified 7 months ago • written 7 months ago by shenwei3564.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 662 users visited in the last hour