Hello! I'm a first time poster so I hope I`ll be able to formulate the question succinctly and in accordance to forum guidelines.
We have Ampliseq RNA- seq data from 33 human donors. The donors vary in age from 1 year old to 81 years old. Our goal is to find any Changes in transcript that correlates to age. 5 of the samples have been handled by Another member of our Group and cluster on an MDS- plot(using plotMDS) so I also want to correct for the batch- effect.
Below is a snippet of my code:
#y is a DGElist- object, batch is a factor with states which of the 2 batches our libraries belong to and Anno$Age is a column from a dataframe that specifices the age of each donor.
design<-model.matrix(~batch+as.numeric(Anno$age))
y<-estimateDispersion(y, design)
fit<-glmQLFit(y, design)
qlf<-glmQLFtest(fit)
5 of our samples have a very high expression of transcripts that is commonly found in a cell type that is found Close to but not in the tissue we are examining. We suspect that it represents contamination but as the samples with the highest expression of these transcripts are older we get a lot of hits("significantly changed" genes) that appear to correlate with age but we suspect may actually be due to the contamination.
Is there any way to compensate for this effect? I have thought about placing the libraries with the most suspected contamination in a separate batch so as to correct for the suspected contamination, but I am hesitant as to whether or not that is reasonable. These samples do not cluster independently on a MDS- plot. We have no other objective way to quantify any contamination so our only way to determine whether contamination exists is from our RNA- seq data and I am concerned we could tailor our data to our expectations.
I am a beginner to bioinformatics and this is my first post so I would appreciate any advise. If you`d like me to revise/restate the question or in any way clarify I'm happy to do so. If you have any concerns about the code snippet shown(or would like to see the entire code), please let me know.
Addition: The ages of the donors are: 1 3 5 8 12 13 14 16 16 16 17 17 18 21 27 28 31 33 35 50 50 55 57 57 58 60 63 72 73 73 77 79 81
I added the
edgeR
tag to attract the developers / maintainers of the tool for an experts opinion. Just to add details, can you post the age distribution of your donors to see if it is continuous or rather bimodal? That might help others to come up with good answers.