Question

How do I normalize this data for plotting?

0

Entering edit mode

7.7 years ago

MAPK ★ 2.1k

Hi, I am sorry for asking this silly question, but I am really confused with the normalization of this data set. I have three fastq files for RNAseq data. I have sampleA, sampleB and sample C. Suppose, the total reads in sampleA = 5 millions, sampleB = 7 millions and sampleC = 8 millions. Now, I have calculated the nucleotide frequency in sequences with 18, 19, 20 and 21 bases long from each of these fastq file. I want to plot the frequency of these A, T,G,C in these sequences and before I do the plotting, I need to normalize the frequency data matrix.

Sample A
seq A  C  G  T
18  123344  922299 255253  832388
19  642245  454252  7424534  323444
20  133455  545543  543344 93322
21  153335 115543 1633345 213333


SampleB
    seq A  C  G  T
    18  123344  93399 235553  83382
    19  644225  245452  7442534  3311444
    20  1133455  2335543  225344 22322
    21  112335 112243 1622245 213223

Sample C
    seq A  C  G  T
    18  122222  22219 233553  343388
    19  6445  22452  722534  444212
    20  33355  545543  543344 93322
    21  22235 225543 223345 223333

So in order to normalize, do I just add the total reads in all three (i.e. 5+7+8=20 million reads) and divide all A, T,G,C of each sample ? Or Do I just divide with the total reads of each sample (for example, divide A,T,G,C columns of sample A with 5 million)? How do I get the proportional estimate of nucleotide frequency in each sample? Thank you for your help.

normalization • 1.7k views

ADD COMMENT • link updated 7.7 years ago by shenwei356 8.7k • written 7.7 years ago by MAPK ★ 2.1k

0

Entering edit mode

Hi, I suppose the reads are assumed to be coming from a reference genome/ transcriptome. In that case, normalizing by mapping % might be one way.

So instead of counts of A/C/T/G, you could have proportion of A/C/T/G in mapped reads vs. unmapped reads. So, if the base composition is changing, looking at ratio helps make different sized libraries comparable.

ADD REPLY • link 7.7 years ago by Amitm ★ 2.3k

score 2 · Accepted Answer · 2018-02-12

2

Entering edit mode

7.7 years ago

Devon Ryan 105k

Divide by the total number of reads in each sample, as is done in FastQC's output. If you need the proportion of bases per sample, then just average the aforementioned values (assuming everything is the same length).

ADD COMMENT • link 7.7 years ago by Devon Ryan 105k

score 2 · Accepted Answer · 2018-02-12

Here's a way in R using dplyr and tidyr, only data of one sample is processed.

library(dplyr)
library(tidyr)
library(ggplot2) # for plot
# library(gglogo)  # for plot

df <- read.csv("SampleA.tsv", sep = "\t", header = FALSE)

df <- t(df)              # transpose
colnames(df) <- df[1, ]  # first row as colnames
rownames(df) <- NULL     # remove
df <-df[-1,]             # remove first row
df <-  as.data.frame(df) # to data frame
# seq     18      19     20      21
# 1   A 123344  642245 133455  153335
# 2   C 922299  454252 545543  115543
# 3   G 255253 7424534 543344 1633345
# 4   T 832388  323444  93322  213333

df.n <- df %>% gather(pos, count, -seq)
#    seq pos   count
# 1    A  18  123344
# 2    C  18  922299
# 3    G  18  255253
# 4    T  18  832388
# 5    A  19  642245
# 6    C  19  454252
# 7    G  19 7424534

df.n$count <- as.integer(df.n$count) # convert char to int

df.n <- df.n %>%
  group_by(pos) %>%
  mutate(prop = count / sum(count))
# seq   pos   count       prop
# <fctr> <chr>   <int>      <dbl>
# 1      A    18  123344 0.05781884
# 2      C    18  922299 0.43233765
# 3      G    18  255253 0.11965261
# 4      T    18  832388 0.39019090


p <- ggplot(df.n, aes(x = pos, y = prop, color = seq, fill = seq, label = seq)) +
  geom_bar(stat = "identity", position = "stack")
  # geom_logo(position="classic")
ggsave(p, file = "tmp.jpeg", width = 4, height = 4, dpi = 100)