Question

Subtracting contamination from RNASeq data

0

Entering edit mode

6.9 years ago

Gareth Palidwor ★ 1.6k

I have a set of RNASeq data for various points of a differentiation time series, with replicates. A known fraction of the cells in each sample are MEFs, and that fraction varies quite a bit per sample. I have expression data for a pure MEF sample grown under similar conditions.

I'd like to do a fold change analysis between time points, subtracting the MEF expression contamination in such a way that the resulting increased variance per gene is factored into the fold change analysis.

It seems it may be possible to do it within DESeq2 (for example) or using svaseq but I can't figure out how. Can anyone recommend a strategy for doing this?

EDITED TO ADD: To clarify, this is a mouse cell line differentiation series that is contaminated by Mouse Embryonic Fibroblasts. As the contaminating MEF mRNA are the same species as the differentiating cell mRNA I can't remove them based on species mapping.

RNA-Seq • 2.5k views

ADD COMMENT • link updated 6.9 years ago by h.mon 35k • written 6.9 years ago by Gareth Palidwor ★ 1.6k

0

Entering edit mode

Edited based on OPs edits.

Hi Gareth,

my immediate thought was that if you can sequence the MEFs and get a profile of that alone then possibly you could remove/scale genes based on their expression there. A quick search finds this GEO dataset where there is a single MEF RNAseq available. Not sure of methods, hopefully another poster can chime in. Good luck.

Bruce.

ADD REPLY • link 6.9 years ago by bruce.moran ▴ 960

0

Entering edit mode

Bruce: Thanks for your response, clarification added above in the original post. The differentiating cell line is also mouse so I can't remove the contamination by mapping.

ADD REPLY • link 6.9 years ago by Gareth Palidwor ★ 1.6k

0

Entering edit mode

Hi Bruce:

Removing a scaled percentage of MEF reads based on the contamination level estimate is straightforward and I've done that however, it messes up the fold change analysis. Subtraction will increase the expression variance by varying amounts per gene and that information is lost when I generate a new table of subtracted counts. But that increased variance should be somehow incorporated in to the fold change analysis; that's what I'm trying to figure out how to do.

ADD REPLY • link 6.9 years ago by Gareth Palidwor ★ 1.6k

0

Entering edit mode

I see where you are coming from, and I follow the logic. What if you estimate the contamination proportion using genes only expressed in MEFs, then remove per library/sample avoiding the bias? No idea how to incorporate that change in variance. BTW may be helpful to give a blow-by-blow account of what you did prior to this issue, especially if others have a similar problem in the future and you can resolve it.

ADD REPLY • link 6.9 years ago by bruce.moran ▴ 960

0

Entering edit mode

I got an answer for this from Michael Love on the Bioconductor support forum:

So you could put it in the model as a numeric covariate (you don't do anything special just put it in the design). This however assumes the relationship with expression is log linear (so linear with log expression). You probably want linear with expression though. You can try transforming the MEF variable before putting it in the design, if you expect a certain relationship.

https://support.bioconductor.org/p/99042/#99161

ADD REPLY • link 6.8 years ago by Gareth Palidwor ★ 1.6k