Question: RNASeq data normalisation affected by genes with large number of read counts
0
gravatar for poojasethiya
22 months ago by
poojasethiya80
poojasethiya80 wrote:

I have two technical replicates for RNASeq data. They have fair correlation(~0.6). But upon comparing the differentially expressed genes, the genes which are UP in one replicate are down in other replicate, giving biasness towards the biology. I have used TPM to get normalized gene expression value in both the cases. To check in details, first I calculated TPM for entire gene list and picked 6 genes (table1) showing contrasting DE pattern. Then I selected only these 6 genes to calculate TPM(table2). From table2 it can be seen that the log2FC of rep1 and rep2 are in agreement as against the table1. This says that there are some genes whose value affect the calculation at entire gene list level (in table1). My question is how to overcome the biasness in entire calculation due to some of the genes which affect the calculation. Are there any packages available to deal such data?

Table1 Table2

sequencing rna-seq next-gen R • 718 views
ADD COMMENTlink modified 22 months ago by i.sudbery5.0k • written 22 months ago by poojasethiya80
0
gravatar for lessismore
22 months ago by
lessismore640
Mexico
lessismore640 wrote:

If the genes are completely distorting your analyses i suggest to remove them from the gtf of your reference genome, remapping your reads, and see how this modification could improve your analysis.

ADD COMMENTlink written 22 months ago by lessismore640

I tried performing the analysis again by removing genes which have very high read counts, but in this case also I am not able to completely eliminate the bias. Moreover, removing these genes can affect the biological interpretation. Considering this fact, I am searching for methods/tool to handle highly expressing genes/ outliers.

ADD REPLYlink written 22 months ago by poojasethiya80
0
gravatar for i.sudbery
22 months ago by
i.sudbery5.0k
Sheffield, UK
i.sudbery5.0k wrote:

Firstly 0.6 is not a good correlation.

Secondly, think about the normalisation you want:

  • if you are looking at changes between conditions (and it appears you are because you are talking about genes which are up and genes which are down), you need to do between sample normalisation, but between gene normalisation (which is what TPM is). EdgeR, deseq and limma all implement between sample normalisation on counts (not TPM).

  • if you are looking for between gene effects, (like correlation of expression levels, not correlation of fold changes), then TPM is suitable, but pearsons correlation coefficient is not suitable. Proportionality has recently be suggested as a better alternative for exactly the reason you outline in your question. See this paper. by Lovell et al.

ADD COMMENTlink written 22 months ago by i.sudbery5.0k

My aim is to check why there exists such a variation between the two replicates. As you correctly pointed out I am trying to find changes between conditions, I have tried tools like DESeq2, edgeR and DEGSeq but none of the tools completely eliminate bias of contrasting genes. Would you like to suggest anything else.

ADD REPLYlink written 22 months ago by poojasethiya80

My aim is to check why there exists such a variation between the two replicates. As you correctly pointed out I am trying to find changes between conditions, I have tried tools like DESeq2, edgeR and DEGSeq but none of the tools completely eliminate bias of contrasting genes. Would you like to suggest anything else.

ADD REPLYlink written 22 months ago by poojasethiya80

Take a look at an MDS plot of your data, is one of these replicates massively outlying? In any case it seems you need a stronger normalisation than you are doing. Hand-calculated TPMs generally use the total number of reads in a sample, but no mainstream RNAseq package uses this, in fact they exclude the most highly expressed genes from the calculation of normalisation factors.

Take a look at the different normalisation methods available in the various packages (DESeq, limma-voom, edgeR). Generally, the strongest normalisation is quantile-quantile normalisation, which forces all samples to have the same distribution. It could just be however, than one of your replicates isn't very good.

ADD REPLYlink written 22 months ago by i.sudbery5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1039 users visited in the last hour