I have a question about normalization of gene expression data.
I have RNA-seq data from 4 different cell lines. Each cell line was treated with a mock and drug1. So in total I have 8 RNA-seq samples. Cell lines 3 and 4 have a physiological response to this drug1 and cell ines 1 and 2 do not have any response. So what I am looking for is to identify genes in samples 3 and 4 that may underlie the physiological response to drug1.
This is what I have done.
1. I aligned the reads to the genome (I do not care about splice forms or discovery of new isoforms) and used BWA MEM or it. I also did not remove potential PCR artifacts by picard because the reads were only 50bp and therefore the chances of seeing true copies that start and stop at the same site are not negligible.
2. For each gene's exonic region, I counted the reads that fall into that region and normalized by the length of the exon (to make it reads per bp) and I divided by the total number of reads in those sample.
At this point I am not sure how best to proceed and looking for advice.
One way to proceed is for me to generate a difference in gene expression for each sample (by subtracting the expression value from drug treated sample from the mock treated sample; for example,
d_k = 1 + (e_k_drug1 - e_k_mock)/ e_k_mock
where e_k_mock is the expression level of gene k from the mock sample. By doing the above normalization, the values d_k are converted into a number which can be compared relative to the expression level of gene in the mock sample. So for example, if the genes do not change their expression, then the d_k value is 1; if the genes expression doubled upon drug1 treatment, the d_k value is 2 and so forth. Does this sound like a reasonable way to transform the values?
However, the issue with the above normalization is that the expression values of a given gene between the mock and drug1 treatment may not be compariable. For example, for unknown reasons, the normalized expression of each gene is higher in drug1 treated samples, how do I correct for it? I am wondering if I should have a second level of normalization by making all the gene expression values for each sample into a 0 mean and 1 standard deviation. This way, all the expression values are now relative to the expression within the sample and the distribution of the d_k values will have a 0 mean. Any suggestion would be much appreciated.
Please note, I want to properly understand how the normalization is done and that is why I am not using the existing tools such as cuffdiff etc.