Hi, I am sorry for asking this silly question, but I am really confused with the normalization of this data set. I have three fastq files for RNAseq data. I have sampleA, sampleB and sample C. Suppose, the total reads in sampleA = 5 millions, sampleB = 7 millions and sampleC = 8 millions. Now, I have calculated the nucleotide frequency in sequences with 18, 19, 20 and 21 bases long from each of these fastq file. I want to plot the frequency of these A, T,G,C in these sequences and before I do the plotting, I need to normalize the frequency data matrix.
Sample A
seq A  C  G  T
18  123344  922299 255253  832388
19  642245  454252  7424534  323444
20  133455  545543  543344 93322
21  153335 115543 1633345 213333
SampleB
    seq A  C  G  T
    18  123344  93399 235553  83382
    19  644225  245452  7442534  3311444
    20  1133455  2335543  225344 22322
    21  112335 112243 1622245 213223
Sample C
    seq A  C  G  T
    18  122222  22219 233553  343388
    19  6445  22452  722534  444212
    20  33355  545543  543344 93322
    21  22235 225543 223345 223333
So in order to normalize, do I just add the total reads in all three (i.e. 5+7+8=20 million reads) and divide all A, T,G,C of each sample ? Or Do I just divide with the total reads of each sample (for example, divide A,T,G,C columns of sample A with 5 million)? How do I get the proportional estimate of nucleotide frequency in each sample? Thank you for your help.

Hi, I suppose the reads are assumed to be coming from a reference genome/ transcriptome. In that case, normalizing by mapping % might be one way.
So instead of counts of A/C/T/G, you could have proportion of A/C/T/G in mapped reads vs. unmapped reads. So, if the base composition is changing, looking at ratio helps make different sized libraries comparable.