Question: How to normalise RNA-sequencing data that only consists of coverage for around 250 genes?
gravatar for dwooi7417
12 months ago by
dwooi74170 wrote:

Hi I am new to Biostars but I have run into a problem where I do not have the expertise to solve.

I performed RNA capture-sequencing on a bunch of cancer samples which means only a select panel of genes are sequenced. With this data I would have liked to apply common across sample normalisation methods such as RLE (Deseq) and TMM however being that this dataset is only 250 genes which were chosen due to their involvement in cancer I worry that they may not follow the assumptions that underly those methods (the main one being that the majority of genes in a sample are not differentially expressed).

I do have ERCCs spiked into the samples. While the intention was to have ERCCs spiked in at relatively similar proportions, some samples ended up with a larger proportion of reads mapping to the ERCCs. Can I still use these ERCCs for normalisation through RUV? Are there any other methods of normalisation? Without a control or any biological replicates how can I check if the normalisation has worked?

ADD COMMENTlink modified 11 months ago by Charles Warden7.8k • written 12 months ago by dwooi74170
gravatar for Devon Ryan
12 months ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

You don't need to use RUV, just estimate the scaling factors using the ERCC spike-ins and apply that to the counts from the cancer panel. In fact, the estimateSizeFactors() function in DESeq2 has a controlGenes parameter meant to do exactly this, with the idea being to remove the spike-ins after size factor estimation.

ADD COMMENTlink written 12 months ago by Devon Ryan96k
gravatar for Charles Warden
11 months ago by
Charles Warden7.8k
Duarte, CA
Charles Warden7.8k wrote:

This sounds kind of like an nCounter experiment.

While I think this may mean you need to do some trial and error with your own data set (while some specific nCounter methods are provided, re-analysis with some more general methods seemed to be important for the data sets that I have seen).

So, if you look into that literature, that may give you some ideas. While you don't exactly have positive and negative sequences, you could try test using ERCC counts and/or highly expressed housekeeping counts (or all total / aligned counts).

If the total reads per sample varies a lot, then you may have additional issues. However, these are my thoughts.

ADD COMMENTlink written 11 months ago by Charles Warden7.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1763 users visited in the last hour