Question: RNAseq data analysis help
0
gravatar for kumars.sv
5 weeks ago by
kumars.sv0
kumars.sv0 wrote:

Hi all!

Here is my question and your input is well appreciated. Experiment has 3 groups and each group has only one sample. Three groups are control, treated 1, treated 2. Treated 1 and Treated 2 have library size of 18 and 20 million reads respectively. Control has a library size of 70 million reads. Library size numbers are raw read counts. In addition, control sample is from a different run. Sequencing is illumina and organism is human. I have following questions:

a) Can i use control from a different run?

b) Since the control has 3 times the reads compared to experimental samples, how can I normalize the reads?

c) Does TMM/rlog take care of huge discrepancy in read numbers?

d) Do I have to include batch in my model matrix and design (in edger)?

Any help is appreciated and thanks.

ADD COMMENTlink written 5 weeks ago by kumars.sv0
1

a) Can i use control from a different run?

Do you mean run = from a different flow cell? From what I've seen flow cell effects are typically minor. Did the sample undergo the same library prep as the other samples?

b) Since the control has 3 times the reads compared to experimental samples, how can I normalize the reads?

Any of the established procedures will do. I suggest you quantify reads against a reference transcriptome with salmon, then aggregate the transcript level abundance estimates it produces to the gene level with tximport. Normalization can be done e.g. with the RLE method from DESeq2 or TMM from edgeR. rlog itself is not used in DEG analysis. Check the manuals of the respective tools for details.

c) Does TMM/rlog take care of huge discrepancy in read numbers?

I think yes, as long as you sequenced deep enough to capture most of the genes that are relevant. If you do shallow sequencing you probably have quite some dropouts so many genes missing and having counts of zero. That might violate the assumptions of the normalization procedure. 18mio raw reads (given library is ok) should probably be fine.

d) Do I have to include batch in my model matrix and design (in edger)?

You do not have replicates right? Which comparisons do you want to make?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ATpoint24k

a) From a different run (experiment), not from different lane, that too, a different experiment conducted on a different date (i.e control is sequenced on day 1 and samples are sequenced on day 62). Libraries are made on different days, not on the same day.

b) Since there are no replicates per group, i would use edgeR and BCV for eukaryotes

c) Thanks for clarifying the point

d) No replicates in any group. Client wants expt 1 vs control, expt2 vs control, expt 1 vs expt 2 (having single sample, in each group)

ADD REPLYlink written 4 weeks ago by kumars.sv0
1

Well, the analysis is going to be exploratory at best. This BCV is highly sample specific, so choosing any value is arbitrary without replicates. Your customer designed a poor experiment. But if you are going to use any BCV simply follow the edgeR manual. Batch effect can have a strong impact, but again without replicates there is nothing you can do about it. Tell your customer to 1) not overly trust the results and 2) design better experiments in the future to have meaningful results and 3) most importantly, validate those genes they will use downstream with an independent method such as qPCR. That is at least what I would do.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ATpoint24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1710 users visited in the last hour