Question: Differential expression analysis with few genes. What could have happened?
gravatar for rnaka09
20 days ago by
rnaka0930 wrote:

Hello everyone!

In my PhD thesis, I am working with differential expression analysis, trying to verify genes related to thermoregulation in beef cattle. I have 66 sequenced, paired-end RNA samples, divided into 2 lanes per sample. The first steps were to check the quality of these samples by FASTQC, then do the trimming (using the Trimmomatic software with the parameters: LEADING: 30 TRAILING: 30 SLIDINGWINDOW: 5: 20 AVGQUAL: 20 MINLEN: 50) and then checking the quality again of these samples by FASTQC. The FASTQC reports did not show anything out of the ordinary, so I proceeded to align the samples using the Hisat2 software.

All of these steps were done on the Linux terminal in bash. Post alignment I went to the R software, to count reads with FeatureCounts, the command I used was as follows:

fc <- featureCounts(files = fls, 
                    annot.ext ='/home/.../Bos_taurus.ARS-UCD1.2.102.gtf' , 
                    isGTFAnnotationFile = TRUE,
                    GTF.featureType = "exon", 
                    GTF.attrType = "gene_id", 
                    useMetaFeatures = TRUE, 
                    strandSpecific = 0, 
                    isPairedEnd = TRUE, 
                    nthreads = 30)

As my samples were in two lanes, I added the read count by samples with the following command:

cont_fc2=(cont_fc[1:(ncol(cont_fc)-1)] + cont_fc[2:ncol(cont_fc)])[c(T,F)]

That done, I started to analyze the differential expression with Deseq. The results I got were few genes differentially expressed as in the example below:

out of 20328 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 1, 0.0049%
LFC <0 (down): 4, 0.02%
outliers [1]: 22, 0.11%
low counts [2]: 0, 0%
 (mean count <0)

Most of the analyzes were made using the “Beginner's guide to using the DESeq2 package”, by Love, M .; Anders, S. and Huber, W. I did the test with EdgeR, but the results were similar. To check if I did something wrong, I used the data that was used in the mentioned guide, data from the "parathyroidSE" package and also the raw RNA sequencing data mentioned in this guide so that I could do the same steps that I did with my data. I used the same parameters that I used in my data. This was done to check if something I did before the differential expression analysis, had an influence on the low results I found. The results were as follows:

results with the ready-made data from the “parathyroidSE” package:

out of 32082 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 38, 0.12%
LFC <0 (down): 44, 0.14%
outliers [1]: 0, 0%
low counts [2]: 24200, 75%
(mean count <210)

results with the parameters I used in my alignment:

out of 25980 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 14, 0.054%
LFC <0 (down): 14, 0.054%
outliers [1]: 0, 0%
low counts [2]: 22,946, 88%
(mean count <116)

As you can see, there is a smaller amount of genes when I used the steps I did in my alignment. Can anyone help me with suggestions of what may have happened?


sequencing rna-seq alignment • 176 views
ADD COMMENTlink modified 20 days ago by swbarnes29.4k • written 20 days ago by rnaka0930

Have you confirmed that your line of code adding the counts from the two lanes worked as expected? You should also include your edgeR code in the question as well.

Also, for sample quality control I would recommend extracting the normalized counts from the edgeR object using the cpm function, generating a scree plot and PCA plot using the PCAtools library, and then checking whether samples are clustering as expected.

ADD REPLYlink written 20 days ago by rpolicastro3.3k

Thanks for the suggestions! The edgeR code is:

1 - created Dgelist

dge_N_agua=DGEList(fc_Nel_agua, group = coldata_N_agua$Group)

2 - Normalize counts

dge_N_agua=calcNormFactors(dge_N_agua, method = 'TMM')

3 - Common and tagwise dispersion


4 - Test between groups

dge_N_agua_test=exactTest(dge_N_agua, pair = c('Menos','Mais'))

5 - Results

res_dge_N_agua=topTags(dge_N_agua_test, n=nrow(dge_N_agua_test$table))
sum(res_dge_N_agua$table$FDR <0.1)
[1] 31

I have 66 samples in total, but divided in groups, the DE analysis was performed in 10 samples, divided into 2 groups (5 and 5). I hadn't mentioned this before. The PCA did not show a separation between the two groups, would that be expected? I don't know if the Scree Plot was as expected.

enter image description here

scree PLOT

ADD REPLYlink modified 19 days ago by ATpoint44k • written 19 days ago by rnaka0930

As @ATpoint stated, there is no clear separation of samples in the first two PCs. Since the first two PCs explain about 60% of the variance in the data, this means that there is either no strong difference between the groups, or that the noise in the data is much higher than the real effect. It might be worth exploring more PCs to see if any dimension captures the difference, since 40% of the variance in the data is not accounted for by the first 2 PCs. You can do this with the pairsplot function of PCAtools.

ADD REPLYlink modified 19 days ago • written 19 days ago by rpolicastro3.3k

Thanks for the explanations and suggestions. I did the pairplot and I think there's not much else to do.


ADD REPLYlink written 18 days ago by rnaka0930

Can you color by group in the pairsplot also?

ADD REPLYlink written 18 days ago by rpolicastro3.3k
gravatar for swbarnes2
20 days ago by
United States
swbarnes29.4k wrote:

The obvious answer is that your analysis is correct that there are no DE genes between whatever you are comparing. Did the PCA look like a big mess?

ADD COMMENTlink written 20 days ago by swbarnes29.4k

No DE genes would be pretty surprising with 66 samples. You should be able to detect pretty small changes even with high variances if you've got 66 samples.

ADD REPLYlink written 20 days ago by i.sudbery10k

Has the image above shows, per group, look like a little mess...!

ADD REPLYlink written 19 days ago by rnaka0930

There is no obvious group separation betwee the samples you show and sample size is limited with n=5 per group, therefore results are not unexpected (=few DEGs).

ADD REPLYlink written 19 days ago by ATpoint44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1729 users visited in the last hour