Normalization of allele specific counts in RNA-seq
3
0
Entering edit mode
2.3 years ago
Ankit ▴ 320

Hi everyone,

I am performing an allele-specific RNA-seq analysis. Basically, I used SNP positions between two alleles of a mouse hybrid strain to separate reads specific to each allele. Now I normalized the data, as usual, using Deseq2 function and performed differential expression analysis.

However, I wonder if the normalization method for allele-specific counts will be the same or different from regular RNA-seq. If I consider only sequencing depth normalization (CPM: counts per million), the regular RNA-seq data will be normalized by total mapped reads and multiplied by 1M.

But now since I already separated my reads specific to two different alleles, does the same normalization method will make sense? In this case, instead of the total mapped reads, I used total reads assigned to each allele and normalized following counts per million (CPM) method.

I read a few things in this post about a similar topic but still not clear to me how should I proceed?

Thanks

RNA-Seq allele-specific normalization SNP next-gen • 1.7k views
1
Entering edit mode

Be careful about nomenclature. CPM does not necessarily mean that libraries are solely scaled by library size. edgeR uses the term for its default normalization in which the library size is further corrected by a correction factor that accounts for library composition changes. The method is called TMM and implemented e.g. in edgeR::cpm()

1
Entering edit mode

I don't think that's right. I've got files made with EdgeR's cpm, and the output is the # counts / # of total assigned counts. No fancier math than that. The documentation for EdgeR's cpm doesn't mention TMM either. If you make a sample, as in the documentation for the function, it's just the simple division.

2
Entering edit mode

By default library size is the colSum, you are right on that one. Still, if the normalization factors are present in the DGElist object they will be used to multiply the colSums with to correct for composition changes between samples.

This is the cpm source code from lines 4 to 12:

cpm.DGEList <- function(y, normalized.lib.sizes=TRUE, log=FALSE, prior.count=2, ...)
#   Counts per million for a DGEList
#   Davis McCarthy and Gordon Smyth.
{
lib.size <- y$samples$lib.size
if(normalized.lib.sizes) lib.size <- lib.size*y$samples$norm.factors
cpm.default(y\$counts,lib.size=lib.size,log=log,prior.count=prior.count)
}


The same would hold true for DESeq2::fp(k)m, where you can choose between colSums or size factors for correction.

0
Entering edit mode

Hi,

If anyone has a suggestion please let me know.

I would appreciate any help.

Thanks

0
Entering edit mode

Hi,

I would appreciate any suggestion.

Thanks

3
Entering edit mode
2.3 years ago

Michael Love (the developer of DESeq2) has written a nice manual page on how to run DESeq2 for allele-specific gene expression analysis - http://rstudio-pubs-static.s3.amazonaws.com/275642_e9d578fe1f7a404aad0553f52236c0a4.html.

0
Entering edit mode

0
Entering edit mode

Thank you for sharing this link. The manual is well explained. What I understood is that author did not use size factor for normalization rather set the size factor to unity and calculated alternate vs reference allele ratios.

There are few things, which I didn't get:

1. In case you set the size factor to unity, the counts will remain same. will they be normalized? how to extract normalized counts?

2. The alt / ref ratio will find differentially expressed genes between control and treated interaction. But how will I find combination differences between allele1 and allele2. So for example

control-allele1 vs control-allele2

control-allele1 vs treated-allele1

control-allele2 vs treated-allele2

treated-allele1 vs treated-allele2

In the manual resultsNames(dds) information are as follows: [1] "Intercept" "condition_treated_vs_control" [3] "conditioncontrol.sample2" "conditiontreated.sample2"
[5] "conditioncontrol.sample3" "conditiontreated.sample3"
[7] "conditioncontrol.sample4" "conditiontreated.sample4"
[9] "conditioncontrol.sample5" "conditiontreated.sample5"
[11] "conditioncontrol.countalt" "conditiontreated.countalt"

conditioncontrol.countalt gives the alt / ref ratio in control and conditiontreated.countalt will gives the alt vs ref ratio in treated samples. So, might be DE genes can be obtained for combinations,

conditioncontrol.countalt gives the alt / ref ratio = control-allele1 vs control-allele2

conditiontreated.countalt gives the alt / ref ratio = treated-allele1 vs treated-allele2

There is one command to call out interaction between control and treated sample:

res.diff <- results(dds, contrast=list("conditiontreated.countalt", "conditioncontrol.countalt"))

But I don't how to extract for

control-allele1 vs treated-allele1

control-allele2 vs treated-allele2

0
Entering edit mode

Hi Ankit,

To be on a safe side, you better post your question in Bioconuctor forum. The developer of DESeq2 is very responsive and helpful.

0
Entering edit mode

Hi, Thank you for the suggestion. So I posted here https://support.bioconductor.org/p/124191/

1
Entering edit mode
2.3 years ago

Yes it does. The CPM normalization is nessesary since your library sizes are different. If you normalize each allele individually you will impose an assumption that the allele expression from each parent is similar - which I would think is a bit of a stretch due to amongst other x-inactivation.

For the allele specific differential expression are you running DESeq2 one or two times?

0
Entering edit mode

Hi,

So once after separating the allele-specific counts, I normalized these counts either by CPM or Deseq2 function.

For now, let's focus on Deseq2, a single Deseq2 function [DESeq()] can both normalize and call differential expression. So, in this case, I have just given low-counts-filtered RAW counts to Deseq() function. The matrix of counts, which I have contains allele1 and allele2 in a separate column and there are 2 replicates in this way. So matrix column header example looks like this: GenesID Allele1_rep1 Allele2_rep1 Allele1_rep2 Allele2_rep2

I run Deseq2 only one time depending on what output I wanted. For only normalized counts I ran estimateSizeFactors() function. For differential expression, I ran Deseq() function.

Now things which created doubt to me in normalization are:

1. Do I need to normalize the data and if yes how? Because now the total number of reads mapped to each allele is a combination of coverage (sequencing depth) and the difference in expression between two alleles. I am not sure if dividing the assigned counts with total mapped reads (after separating) will be the correct way.

2. Which normalization method to choose? Will Deseq2 normalization [estimateSizeFactors()] will be helpful? For me, any normalization method (even TMM) is ok which can give me reliable normalized counts.

3. You suggested X-inactivation will also affect the normalization process. I agree. X chromosome anyway not so important in my allele-specific analysis. If I remove chrX and chrY genes and then normalize by total reads mapped, will it make sense?

I would appreciate any help and suggestions.

Thanks

1
Entering edit mode

All the parts of the DESeq function can be called separately, including size normalizing. So you don't have to call it separately if you just want it to do it the default way. Just let the dds <- DESeq(dds) function handle it.

0
Entering edit mode

Do you know if Deseq2 can take normalized counts by any chance? For example CPM normalized counts?

1
Entering edit mode

Unless you are sure you understand the DESeq algorithm as well as the author, don't try to trick it or change it. It wants raw counts only.

1
Entering edit mode

Please also note that problems described above which are not solved just by removing the sex-chromosomes as inactivation happens all over the genome (x-inactivation is just the most prominent example). The way you do it now (two columns per sample) is wrong. DESeq2 will think you have twice the number of samples you have - and they will be highly inter-correlated which will cause a whole lot of problems. You need to reformat the data to have only one column per sample and two rows per gene. This will also solve the normalisation factor problem since only one factor per sample will be calculated. Furthermore it will be better for FDR control.

0
Entering edit mode

So I am thinking to normalize by coverage (total reads mapped) of original total mapped reads. I will normalize both the alleles counts by the total reads mapped (before separating/assigning the reads = allele1+allele2+non-SNP reads). By dividing with total mapped reads, the bias of either allele due to coverage difference will be removed. These coverage normalized counts I will use to calculate FDR, z-score etc. For differential expression analysis, I would give this normalized matrix to NOISeq which accepts the normalized counts. I am not sure if Deseq2 will accept the normalized counts

0
Entering edit mode

I just re-read the original question and I'm sorry I misunderstood the objective so please ignore most of my last comment and see my new. For your last comment unless you have a very good reason (which allele specific data is not) I would always go with one of the standard tools instead of something ad hoc. And when you have raw counts you should always use those as input to the DE tools.

1
Entering edit mode

The main problem is that the paired samples are not independent meaning they will be more similar than you would expect. One way of handling this is by using a paired design where you simply adding an additional parameter in your model for the different individuals. You can see an example in the DESeq2 vignette here. An additional problem is that the size factors estimated (by the estimateSizeFactors() function) should not be (to) different (since then it might normalize away some of the things you are intesrested in). I would look closely at the size factors estimated and most likely manually set them to the same factor (estimated from the non-allele-devided count matrix) using sizeFactors(object) <- value.

0
Entering edit mode

Thanks for the suggestions.

There are three things which I could think of now:

1. Set the size factor manually before running Deseq(). Which size factor to set? the one obtained from matrix before splitting? How to calculate that? If I set size factor will Deseq2 not calculate its own size factor while running its function Deseq(). Then how will I call differential expression using Deseq2?

2. Not to call differential expression and only consider calculating z-score and fisher test p-value to identify the differences between allelic expression. In this case, I was thinking to normalize both the alleles by library size of originally mapped reads (Before splitting). The only method which can give me coverage normalized counts, in this case, is CPM (because of the limitation of using any tools). CPM will coverage normalized my matrix.

3. If there is a tool exist which can take coverage-normalized data, I can call differential expression too. But is there any tool exist?

Thank you

0
Entering edit mode
2.3 years ago
Ankit ▴ 320

Hi,

I came across this tool called ISoLDE (https://www.bioconductor.org/packages/release/bioc/html/ISoLDE.html) for allele specific analysis. In the example data sets they have given both raw counts and normalized counts matrix. I wanted to know if there is similarity between factor used to normalize both the alleles. So I divided the raw counts by normalized counts and found that both the alleles were normalized by same size factor.

This R-package has used RLE method for normalization.

But I dont know how they calculated the size-factor. If anyone can help me how RLE method can be used for normalization of allele-specific counts, please let me know.

I would appreciate any valuable suggestions.

Thanks

0
Entering edit mode

Hello @Ankit can you explain to me how you did the normalization I am facing the same problem right now? right now I have the allelic count matrix. can please share your suggestions, It will be very grateful( My mail ID hemantcnaik@gmail.com)

1
Entering edit mode

Hi,

You can set this in Deseq2.

Calculate size factor from before split matrix :Bulk.

Use this size factor and set for after allele split matrix.

Suppose from Bulk sample 1 has size factoe 0.6. Then after allele specific splitting, use 0.6 for both sample1-allele1 and sample1-allele2

Then normalize as usual. Deseq2 will use user supplied size factor for normalisation.

0
Entering edit mode

@Ankit thank you for your valuable responses, can you please guide me through code. It will be very helpful