Question

Problem with differential analysis between two groups having low raw counts

0

Entering edit mode

4.0 years ago

Biologist ▴ 290

Dear all,

I have 40 liver cancer samples. Among them I have two groups. GroupA (26 samples) and GroupB (14 samples) and I have a total of 10,000 lncRNA transcripts. I'm trying to do differential analysis between GroupA vs GroupB using edgeR.

With edgeR analysis I see that there are around 300 upregulated lncRNA transcripts in GroupA vs GroupB. But I see that all these upregulated lncRNA transcripts are not expressed in all GroupA samples (highly expressed only in few samples), and moreover I see that these are not at all expressed in GroupB which means zero counts in GroupB.

I took a specific lncRNA transcript and made a box plot which looks like below: enter image description here

There are many other upregulated lncRNA transcripts expressed only in few samples of GroupA and no expression in GroupB for differential analysis between GroupA vs GroupB.

What I should do now? Can I carry on with further analysis with those 300 upregulated lncRNA transcripts? Or Is there any other way to get true differentially expressed genes?

RNA-Seq edger differential analysis expression • 1.7k views

ADD COMMENT • link updated 4.0 years ago by ATpoint 81k • written 4.0 years ago by Biologist ▴ 290

1

Entering edit mode

These are "true" differentially expressed genes. It is a group mean that is significant, not every single event. You can use FilterByExpr and try to tweak parameters there to try and eliminate samples with many low counts. Or you can manually require that at least x% of samples per group have at least y counts. Maybe 50% per group must have at least 10 raw counts or so.

ADD REPLY • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

thank you...I actually used Filtering step from edgeR already like keep <- rowSums(cpm(y) > 1) >= sample number. With these among 10,000 I kept around 7000 lncRNA transcripts which are used for DEA.

In there is other transcript in below box plot Is this still said as upregulated in GroupA?

enter image description here

thank you.

ADD REPLY • link 4.0 years ago by Biologist ▴ 290

0

Entering edit mode

So are you performing differential analysis of just lncRNAs ? May be that’s wrong. You need to use all genes quantified.

ADD REPLY • link 4.0 years ago by GouthamAtla 12k

0

Entering edit mode

I'm interested in only lncRNAs. I don't think it is wrong.

ADD REPLY • link 4.0 years ago by Biologist ▴ 290

0

Entering edit mode

Are you performing cpm on total number of reads just mapped to lncRNAs ? What if you don’t have a million reads mapped in a sample ? Your library size is total reads or reads mapped only to lncRNAs ?

ADD REPLY • link 4.0 years ago by GouthamAtla 12k

0

Entering edit mode

These 10,000 lncRNA transcripts were novel transcripts in my analysis. So, yes I used cpm only.

ADD REPLY • link 4.0 years ago by Biologist ▴ 290

0

Entering edit mode

Be aware that you should perform analysis on all genes and then select the interesting ones from the results. I hope you did not run edgeR on only the novel transcripts. The power of edgeR comes from sharing information across genes so you benefit from a large number of genes and many of those being highly-expressed and not changing. I suggest you include the novel transcripts (or genes depending on what exactly they are) to the set of genes already present in the reference annotation and repeat analysis.

ADD REPLY • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

From the pipeline lncRNApipe I used, to detect novel lncRNAs....I found approx 10,000 novel lncRNA transcripts. And this is what I used with edgeR for GroupA vs GroupB comparison.

Don't know what else I should use.

ADD REPLY • link 4.0 years ago by Biologist ▴ 290

0

Entering edit mode

In that case, at least you need to provide the library sizes (sequencing depth of each sample) as a vector to edgeR cpm function (lib.size) for the appropriate normalisation. Or merge your novel lncRNAs with known annotations (protein-coding genes and known lncRNAs) to create a final GTF and perform differential analysis all together.

ADD REPLY • link 4.0 years ago by GouthamAtla 12k

0

Entering edit mode

Yes, I added lib.size for cpm. I'm afraid that we if I merge these novel transcripts with mRNA and Known lncRNAs, I won't be able to detect any novel differentially expressed transcripts. Because mRNAs are high expressed and lncRNAs are low expressed.

I see that library sizes of the samples is very low. There are two samples with very high raw counts. I feel like doing downsampling to these two samples (GroupA_12 and GroupB_2). May I know how to this downsampling before DEA? Please let me know.

ADD REPLY • link 4.0 years ago by Biologist ▴ 290

0

Entering edit mode

@geek_y and @ATpoint Could you please check my below comment and tell me your opinion.thanq

ADD REPLY • link 4.0 years ago by Biologist ▴ 290

0

Entering edit mode

Hi, your lib sizes are too small because you are calculating only the lncRNA reads. You can get total number of reads in a bam file and use that number as library size for each sample. Otherwise I am not sure if CPM makes sense (Thats why logCPM values in your boxplots looks very high. If you have less than a million reads and you are scaling it to a million reads, which artificially inflates your signals).

ADD REPLY • link 4.0 years ago by GouthamAtla 12k

score 1 · Answer 1 · 2020-04-24

Since you posted a related question to SE https://bioinformatics.stackexchange.com/questions/13051/how-to-downsample-some-of-the-samples-in-rna-seq-data I will try and explain the background towards what I believe is the way to go:

You should take all your genes and add the novel genes/transcripts that you found to this reference and THEN create a count matrix based on this. From this normalize data with DESeq2 and plot an MA-plot. Compare this with an MA-plot you generate from normalized counts only based on a count matrix from these novel genes or transcripts. What you have to see is that the majority of genes somewhat centers at y = 0 or at least that you have a somewhat symmetric relationship between upregulated and downregulated genes so that the normalization factors have properly captured the size relationship. This aims to check if using only the novel lncRNAs is appropriate for a differential analysis. The DESeq2 assumption that the entire testing is based on is that most genes are in fact not differentially-expressed genes (DEG). I am not a lncRNA guy but from what I know they are rather lowly-expressed, unstable and assembly of novel lncRNA is far from being a truely reliable process, one often picks up artifacts.

I therefore would suggest you include these lncRNAs to the full reference and then perform a normal DEG analysis followed by filtering for the genes you are interested in. The reason for this is that the full reference provides many genes which can be used to reliable normalize the counts and from which information can be shared when it comes to dispersion estimation. It is difficult to give advise without having the data but be sure that you also try and validate some of your findings. From this boxplot you show above it seems that many samples do not even express this lncRNA you show. is this true for all of these novel transcripts? Are these novel transcripts reliable? Try and compare it with known lncRNAs and see if the novel ones have systematically lower counts. This could argue for them being artifacts.

Take home message: Try to increase confidence that the novel lncRNAs are indeed real and do not base your DEG analysis purely on their counts since this could be too unstable in order to get proper results. If you have candidates try and validate with indpendent experiments.