Question

Difference in total gene counts between samples

0

Entering edit mode

23 months ago

bart ▴ 50

Hi,

I'm performing data analysis of samples of which the RNA has been sequenced. I've gotten pretty significant differences in total gene counts of samples. I've added a picture of a barplot below of the total gene counts of all my samples. The y-scale displays total gene counts divided by 10^6. The two samples with the lowest gene counts have a count of 0.2x10^6, while the sample with the highest gene counts has >20x10^6 counts. Does someone know whether this difference is normal and whether it is advisable that I remove certain outlier samples? enter image description here

Thanks in advance!

rna-seq • 1.9k views

ADD COMMENT • link updated 23 months ago by swbarnes2 14k • written 23 months ago by bart ▴ 50

1

Entering edit mode

Rather than relying on simple gene counts you will want to check PCA plots to identify outliers. There is bound to be some variability in gene counts since the sequencing can be uneven). Analysis programs like DESeq2/edgeR will take this into account.

ADD REPLY • link 23 months ago by GenoMax 141k

score 1 · Answer 1 · 2022-05-16

1

Entering edit mode

23 months ago

ATpoint 81k

Some total read count variation is normal as samples never can be perfectly balanced, but this here is technical in terms of poor sequencing as some samples are massively undersequenced. Up to a certain point this can be tolerated in the DE analysis but once depth is so low that expressed genes are zero in counts due to undersequencing it becomes problematic. I would do a PCA (see DESeq2) manual and see whether you see evidence for read depth being a confounding factor. If so then remove the most undersequenced samples or resequence the undersequenced ones (preferred) to increase depth.

ADD COMMENT • link 23 months ago by ATpoint 81k

0

Entering edit mode

Hi, I've added a PCA plot. Sample clustering isn't great unfortunately. This plot uses all the genes, however I'm not sure whether this is ok or if I have to use significant DE genes only. Anyway it seems that sequencing depth isn't the only confounding factor if most samples do not cluster well in the correct groups. I could add sex as a factor to the design part of the DESeqDataSetFromMatrix formula. Is there something else I could do to improve the clustering?

enter image description here

ADD REPLY • link 23 months ago by bart ▴ 50

1

Entering edit mode

Usually one uses the variable genes, like using rowVars() on the output of the DESeq2 vst transformation, and then sort it decreasingly, taking the like top500. But anyway, you can check whether the ones far left might be the ones with tiny depth. As said already by another user 4000 DEGs with such a clustering is at least unexpected. I would really see whether you can resequence some samples to proper depth and check whether exclusion of low-deoth samples changes outcome. It looks visually that the reds are a bit more with tendency to be farleft, so if farleft means low depth then it could be that red is nested with low depth and many of the DEGs are due to zero-inflation because many genes in red are technical (low-depth) zeros while in the other group it is not, just thinking aloud.

ADD REPLY • link 23 months ago by ATpoint 81k

0

Entering edit mode

I've added sex and library size as variables to the design formula in another dataset. However, I've accidently switched the datasets that I'm using to show the PCA plots from: one disease type (GBM) has been switched by another one (BrainMetastases). However, the same problem is present in all datasets so it does not matter that much: I'm having problems with PCA plot clustering. The code I'm using is as follows:

#create deseq2 object by setting design as group (healthy=AC or disease=GBM) and/or sex (Male/M or Female/F) and/or total gene counts (high or low) and filter genes with low counts 
dds<-DESeqDataSetFromMatrix(dataset,colData =metadata, design = ~Patient.group+sex+total_gene_counts)
dds <- estimateSizeFactors(dds)
badgenes<-names(which(apply(counts(dds), 1, function(x){sum(x < 30)}) > 0.9 * ncol(dds)))
ddsFiltered <- dds[which(!rownames(dds) %in% badgenes), ]
ddsFiltered<-DESeq(ddsFiltered)
#perform PCA 
vsd <- vst(ddsFiltered, blind=T)
plotPCA(vsd, intgroup=c("Patient.group"),ntop=500)

enter image description here plotPCA(vsd, intgroup=c("Patient.group", "sex"),ntop=500) plotPCA(vsd, intgroup=c("Patient.group", "total_gene_counts"),ntop=500) plotPCA(vsd, intgroup=c("Patient.group", "total_gene_counts","sex"),ntop=500)

Visually the clustering improves somewhat by adding sex and library size/total gene counts as variables, especially by clustering by total library size/total gene count. However, the variance in the data explained by the principal components does not improve. Is there anything else that I can do to improve it?

ADD REPLY • link 23 months ago by bart ▴ 50

0

Entering edit mode

You can't improve your data. It is what it is. You've got something rather strong going on with gene counts. You need to figure out if that's biological or technical. My fear is that it's technical.

ADD REPLY • link 23 months ago by swbarnes2 14k

0

Entering edit mode

That's bad news but thanks for telling me. Only question: does deseq automatically correct for library size/total gene counts with the vst formula or do I have to specifically add a line to do this? At the moment I have only done the former not the latter. However, I've also seen the following being used:

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds <- DESeq(dds)
normalized_counts <- counts(dds, normalized=TRUE)
vsd <- vst(dds, blind=TRUE)
assay(vsd)
plotPCA(vsd, intgroup=c("variable"),ntop=500)

And this confuses me because the 'normalized_counts' is not being used downstream. So I was wondering if maybe my code does not account for library size

ADD REPLY • link 23 months ago by bart ▴ 50

1

Entering edit mode

dds <- DESeq(dds)

Already includes the library size normalization step. If you made your dds from matrix and tried to skip straight to the vst step, it would tell you to do the size normalization first. So it's already being used.

ADD REPLY • link 23 months ago by swbarnes2 14k

0

Entering edit mode

So to be clear, library size is not a factor which can explain the lack of correct clustering in my pcaplots right? So could I use SVA or RUVSeq to find any batch effects which can explain the lack of clustering and add these to the design formula? I'm trying to find out if there is any way that I can still use my data or if I should stop trying.

ADD REPLY • link 23 months ago by bart ▴ 50

0

Entering edit mode

I think it's glaringly obvious that library size is driving your first PC. I would not be at all confident that the typical techniques will work when there is a hundred fold difference in read count. I would not expect any software to magically make your samples drastically different than they appear to be now. You need to stop with R and find out why some samples have so few reads.

ADD REPLY • link 23 months ago by swbarnes2 14k

0

Entering edit mode

If your two kinds of samples are not that different from each other, you can't fix that. You might have real DE genes, but you won't have many.

ADD REPLY • link 23 months ago by swbarnes2 14k

0

Entering edit mode

I've found ~4000 DE genes (log2fc >1 or <-1 with padj<0.05) so there should be better clustering right?

ADD REPLY • link 23 months ago by bart ▴ 50

1

Entering edit mode

I'd be a little worried about getting so much significant diff expr with such poor clustering, though that could be the result of the sheer number of samples here. I'd definitely add another variable (like sex) and see if that improves things, if it does you're going to want to calculate diff expr for samples from each sex separately.

Also, what happens if you run your clustering again, and also add a gene count variable? You can classify each sample as low/high based on an arbitrary cutoff near the mean/median.