Question

RNASeq data normalisation affected by genes with large number of read counts

0

Entering edit mode

7.9 years ago

poojasethiya ▴ 120

I have two technical replicates for RNASeq data. They have fair correlation(~0.6). But upon comparing the differentially expressed genes, the genes which are UP in one replicate are down in other replicate, giving biasness towards the biology. I have used TPM to get normalized gene expression value in both the cases. To check in details, first I calculated TPM for entire gene list and picked 6 genes (table1) showing contrasting DE pattern. Then I selected only these 6 genes to calculate TPM(table2). From table2 it can be seen that the log2FC of rep1 and rep2 are in agreement as against the table1. This says that there are some genes whose value affect the calculation at entire gene list level (in table1). My question is how to overcome the biasness in entire calculation due to some of the genes which affect the calculation. Are there any packages available to deal such data?

Table1 Table2

RNA-Seq R rna-seq next-gen sequencing • 2.2k views

ADD COMMENT • link updated 7.9 years ago by i.sudbery 21k • written 7.9 years ago by poojasethiya ▴ 120

score 0 · Answer 1 · 2017-09-04

0

Entering edit mode

7.9 years ago

lessismore ★ 1.4k

If the genes are completely distorting your analyses i suggest to remove them from the gtf of your reference genome, remapping your reads, and see how this modification could improve your analysis.

ADD COMMENT • link 7.9 years ago by lessismore ★ 1.4k

0

Entering edit mode

I tried performing the analysis again by removing genes which have very high read counts, but in this case also I am not able to completely eliminate the bias. Moreover, removing these genes can affect the biological interpretation. Considering this fact, I am searching for methods/tool to handle highly expressing genes/ outliers.

ADD REPLY • link 7.9 years ago by poojasethiya ▴ 120

score 0 · Answer 2 · 2017-09-04

0

Entering edit mode

7.9 years ago

i.sudbery 21k

Firstly 0.6 is not a good correlation.

Secondly, think about the normalisation you want:

if you are looking at changes between conditions (and it appears you are because you are talking about genes which are up and genes which are down), you need to do between sample normalisation, but between gene normalisation (which is what TPM is). EdgeR, deseq and limma all implement between sample normalisation on counts (not TPM).
if you are looking for between gene effects, (like correlation of expression levels, not correlation of fold changes), then TPM is suitable, but pearsons correlation coefficient is not suitable. Proportionality has recently be suggested as a better alternative for exactly the reason you outline in your question. See this paper. by Lovell et al.

ADD COMMENT • link 7.9 years ago by i.sudbery 21k

0

Entering edit mode

My aim is to check why there exists such a variation between the two replicates. As you correctly pointed out I am trying to find changes between conditions, I have tried tools like DESeq2, edgeR and DEGSeq but none of the tools completely eliminate bias of contrasting genes. Would you like to suggest anything else.

ADD REPLY • link 7.9 years ago by poojasethiya ▴ 120

0

Entering edit mode

My aim is to check why there exists such a variation between the two replicates. As you correctly pointed out I am trying to find changes between conditions, I have tried tools like DESeq2, edgeR and DEGSeq but none of the tools completely eliminate bias of contrasting genes. Would you like to suggest anything else.

ADD REPLY • link 7.9 years ago by poojasethiya ▴ 120

0

Entering edit mode

Take a look at an MDS plot of your data, is one of these replicates massively outlying? In any case it seems you need a stronger normalisation than you are doing. Hand-calculated TPMs generally use the total number of reads in a sample, but no mainstream RNAseq package uses this, in fact they exclude the most highly expressed genes from the calculation of normalisation factors.

Take a look at the different normalisation methods available in the various packages (DESeq, limma-voom, edgeR). Generally, the strongest normalisation is quantile-quantile normalisation, which forces all samples to have the same distribution. It could just be however, than one of your replicates isn't very good.

ADD REPLY • link 7.9 years ago by i.sudbery 21k