Question: Normalization of allele specific counts in RNA-seq
gravatar for Ankit
7 days ago by
Ankit110 wrote:

Hi everyone,

I am performing an allele-specific RNA-seq analysis. Basically, I used SNP positions between two alleles of a mouse hybrid strain to separate reads specific to each allele. Now I normalized the data, as usual, using Deseq2 function and performed differential expression analysis.

However, I wonder if the normalization method for allele-specific counts will be the same or different from regular RNA-seq. If I consider only sequencing depth normalization (CPM: counts per million), the regular RNA-seq data will be normalized by total mapped reads and multiplied by 1M.

But now since I already separated my reads specific to two different alleles, does the same normalization method will make sense? In this case, instead of the total mapped reads, I used total reads assigned to each allele and normalized following counts per million (CPM) method.

I read a few things in this post about a similar topic but still not clear to me how should I proceed?

Please suggest.


ADD COMMENTlink modified 1 day ago • written 7 days ago by Ankit110

Be careful about nomenclature. CPM does not necessarily mean that libraries are solely scaled by library size. edgeR uses the term for its default normalization in which the library size is further corrected by a correction factor that accounts for library composition changes. The method is called TMM and implemented e.g. in edgeR::cpm()

ADD REPLYlink written 5 days ago by ATpoint21k

I don't think that's right. I've got files made with EdgeR's cpm, and the output is the # counts / # of total assigned counts. No fancier math than that. The documentation for EdgeR's cpm doesn't mention TMM either. If you make a sample, as in the documentation for the function, it's just the simple division.

ADD REPLYlink written 1 day ago by swbarnes26.2k

By default library size is the colSum, you are right on that one. Still, if the normalization factors are present in the DGElist object they will be used to multiply the colSums with to correct for composition changes between samples.

This is the cpm source code from lines 4 to 12:

cpm.DGEList <- function(y, normalized.lib.sizes=TRUE, log=FALSE, prior.count=2, ...)
#   Counts per million for a DGEList
#   Davis McCarthy and Gordon Smyth.
#   Created 20 June 2011. Last modified 26 Oct 2018.
    lib.size <- y$samples$lib.size
    if(normalized.lib.sizes) lib.size <- lib.size*y$samples$norm.factors

The same would hold true for DESeq2::fp(k)m, where you can choose between colSums or size factors for correction.

ADD REPLYlink modified 1 day ago • written 1 day ago by ATpoint21k
gravatar for kristoffer.vittingseerup
5 days ago by
European Union
kristoffer.vittingseerup2.2k wrote:

Yes it does. The CPM normalization is nessesary since your library sizes are different. If you normalize each allele individually you will impose an assumption that the allele expression from each parent is similar - which I would think is a bit of a stretch due to amongst other x-inactivation.

For the allele specific differential expression are you running DESeq2 one or two times?

ADD COMMENTlink written 5 days ago by kristoffer.vittingseerup2.2k


Thanks for the reply.

So once after separating the allele-specific counts, I normalized these counts either by CPM or Deseq2 function.

For now, let's focus on Deseq2, a single Deseq2 function [DESeq()] can both normalize and call differential expression. So, in this case, I have just given low-counts-filtered RAW counts to Deseq() function. The matrix of counts, which I have contains allele1 and allele2 in a separate column and there are 2 replicates in this way. So matrix column header example looks like this: GenesID Allele1_rep1 Allele2_rep1 Allele1_rep2 Allele2_rep2

I run Deseq2 only one time depending on what output I wanted. For only normalized counts I ran estimateSizeFactors() function. For differential expression, I ran Deseq() function.

Now things which created doubt to me in normalization are:

  1. Do I need to normalize the data and if yes how? Because now the total number of reads mapped to each allele is a combination of coverage (sequencing depth) and the difference in expression between two alleles. I am not sure if dividing the assigned counts with total mapped reads (after separating) will be the correct way.

  2. Which normalization method to choose? Will Deseq2 normalization [estimateSizeFactors()] will be helpful? For me, any normalization method (even TMM) is ok which can give me reliable normalized counts.

  3. You suggested X-inactivation will also affect the normalization process. I agree. X chromosome anyway not so important in my allele-specific analysis. If I remove chrX and chrY genes and then normalize by total reads mapped, will it make sense?

I would appreciate any help and suggestions.


ADD REPLYlink modified 4 days ago • written 4 days ago by Ankit110

All the parts of the DESeq function can be called separately, including size normalizing. So you don't have to call it separately if you just want it to do it the default way. Just let the dds <- DESeq(dds) function handle it.

ADD REPLYlink written 1 day ago by swbarnes26.2k

Do you know if Deseq2 can take normalized counts by any chance? For example CPM normalized counts?

ADD REPLYlink written 17 hours ago by Ankit110

Unless you are sure you understand the DESeq algorithm as well as the author, don't try to trick it or change it. It wants raw counts only.

ADD REPLYlink written 12 hours ago by swbarnes26.2k

The way you do it now (two columns per sample) is wrong. DESeq2 will think you have twice the number of samples you have - and they will be highly inter-correlated which will cause a whole lot of problems. You need to reformat the data to have only one column per sample and two rows per gene. This will also solve the normalisation factor problem since only one factor per sample will be calculated. Furthermore it will be better for FDR control. Please also note that problems described above which are not solved just by removing the sex-chromosomes as inactivation happens all over the genome (x-inactivation is just the most prominent example).

ADD REPLYlink written 18 hours ago by kristoffer.vittingseerup2.2k

So I am thinking to normalize by coverage (total reads mapped) of original total mapped reads. I will normalize both the alleles counts by the total reads mapped (before separating/assigning the reads = allele1+allele2+non-SNP reads). By dividing with total mapped reads, the bias of either allele due to coverage difference will be removed. These coverage normalized counts I will use to calculate FDR, z-score etc. For differential expression analysis, I would give this normalized matrix to NOISeq which accepts the normalized counts. I am not sure if Deseq2 will accept the normalized counts

ADD REPLYlink written 17 hours ago by Ankit110
gravatar for Ankit
1 day ago by
Ankit110 wrote:


If anyone has a suggestion please let me know.

I would appreciate any help.


ADD COMMENTlink written 1 day ago by Ankit110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1105 users visited in the last hour