Question

How to normalise RNA-sequencing data that only consists of coverage for around 250 genes?

0

Entering edit mode

4.6 years ago

dwooi7417 • 0

Hi I am new to Biostars but I have run into a problem where I do not have the expertise to solve.

I performed RNA capture-sequencing on a bunch of cancer samples which means only a select panel of genes are sequenced. With this data I would have liked to apply common across sample normalisation methods such as RLE (Deseq) and TMM however being that this dataset is only 250 genes which were chosen due to their involvement in cancer I worry that they may not follow the assumptions that underly those methods (the main one being that the majority of genes in a sample are not differentially expressed).

I do have ERCCs spiked into the samples. While the intention was to have ERCCs spiked in at relatively similar proportions, some samples ended up with a larger proportion of reads mapping to the ERCCs. Can I still use these ERCCs for normalisation through RUV? Are there any other methods of normalisation? Without a control or any biological replicates how can I check if the normalisation has worked?

RNA-Seq sequencing R normalisation • 1.0k views

ADD COMMENT • link updated 4.6 years ago by Charles Warden 8.2k • written 4.6 years ago by dwooi7417 • 0

score 0 · Answer 1 · 2019-09-27

You don't need to use RUV, just estimate the scaling factors using the ERCC spike-ins and apply that to the counts from the cancer panel. In fact, the estimateSizeFactors() function in DESeq2 has a controlGenes parameter meant to do exactly this, with the idea being to remove the spike-ins after size factor estimation.

score 0 · Answer 2 · 2019-09-27

This sounds kind of like an nCounter experiment.

While I think this may mean you need to do some trial and error with your own data set (while some specific nCounter methods are provided, re-analysis with some more general methods seemed to be important for the data sets that I have seen).

So, if you look into that literature, that may give you some ideas. While you don't exactly have positive and negative sequences, you could try test using ERCC counts and/or highly expressed housekeeping counts (or all total / aligned counts).

If the total reads per sample varies a lot, then you may have additional issues. However, these are my thoughts.