Question: Subtracting contamination from RNASeq data
gravatar for Gareth Palidwor
2.4 years ago by
Gareth Palidwor1.6k
Gareth Palidwor1.6k wrote:

I have a set of RNASeq data for various points of a differentiation time series, with replicates. A known fraction of the cells in each sample are MEFs, and that fraction varies quite a bit per sample. I have expression data for a pure MEF sample grown under similar conditions.

I'd like to do a fold change analysis between time points, subtracting the MEF expression contamination in such a way that the resulting increased variance per gene is factored into the fold change analysis.

It seems it may be possible to do it within DESeq2 (for example) or using svaseq but I can't figure out how. Can anyone recommend a strategy for doing this?

EDITED TO ADD: To clarify, this is a mouse cell line differentiation series that is contaminated by Mouse Embryonic Fibroblasts. As the contaminating MEF mRNA are the same species as the differentiating cell mRNA I can't remove them based on species mapping.

rna-seq • 971 views
ADD COMMENTlink modified 2.4 years ago by h.mon28k • written 2.4 years ago by Gareth Palidwor1.6k

Edited based on OPs edits.

Hi Gareth,

my immediate thought was that if you can sequence the MEFs and get a profile of that alone then possibly you could remove/scale genes based on their expression there. A quick search finds this GEO dataset where there is a single MEF RNAseq available. Not sure of methods, hopefully another poster can chime in. Good luck.


ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by bruce.moran650

Bruce: Thanks for your response, clarification added above in the original post. The differentiating cell line is also mouse so I can't remove the contamination by mapping.

ADD REPLYlink written 2.4 years ago by Gareth Palidwor1.6k

Hi Bruce:

Removing a scaled percentage of MEF reads based on the contamination level estimate is straightforward and I've done that however, it messes up the fold change analysis. Subtraction will increase the expression variance by varying amounts per gene and that information is lost when I generate a new table of subtracted counts. But that increased variance should be somehow incorporated in to the fold change analysis; that's what I'm trying to figure out how to do.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Gareth Palidwor1.6k

I see where you are coming from, and I follow the logic. What if you estimate the contamination proportion using genes only expressed in MEFs, then remove per library/sample avoiding the bias? No idea how to incorporate that change in variance. BTW may be helpful to give a blow-by-blow account of what you did prior to this issue, especially if others have a similar problem in the future and you can resolve it.

ADD REPLYlink written 2.4 years ago by bruce.moran650

I got an answer for this from Michael Love on the Bioconductor support forum:

So you could put it in the model as a numeric covariate (you don't do anything special just put it in the design). This however assumes the relationship with expression is log linear (so linear with log expression). You probably want linear with expression though. You can try transforming the MEF variable before putting it in the design, if you expect a certain relationship.

ADD REPLYlink written 2.3 years ago by Gareth Palidwor1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1750 users visited in the last hour