Differential expression analysis with few genes. What could have happened?
1
1
Entering edit mode
12 months ago
rnaka09 ▴ 30

Hello everyone!

In my PhD thesis, I am working with differential expression analysis, trying to verify genes related to thermoregulation in beef cattle. I have 66 sequenced, paired-end RNA samples, divided into 2 lanes per sample. The first steps were to check the quality of these samples by FASTQC, then do the trimming (using the Trimmomatic software with the parameters: LEADING: 30 TRAILING: 30 SLIDINGWINDOW: 5: 20 AVGQUAL: 20 MINLEN: 50) and then checking the quality again of these samples by FASTQC. The FASTQC reports did not show anything out of the ordinary, so I proceeded to align the samples using the Hisat2 software.

All of these steps were done on the Linux terminal in bash. Post alignment I went to the R software, to count reads with FeatureCounts, the command I used was as follows:

fc <- featureCounts(files = fls,
annot.ext ='/home/.../Bos_taurus.ARS-UCD1.2.102.gtf' ,
isGTFAnnotationFile = TRUE,
GTF.featureType = "exon",
GTF.attrType = "gene_id",
useMetaFeatures = TRUE,
strandSpecific = 0,
isPairedEnd = TRUE,


As my samples were in two lanes, I added the read count by samples with the following command:

cont_fc2=(cont_fc[1:(ncol(cont_fc)-1)] + cont_fc[2:ncol(cont_fc)])[c(T,F)]


That done, I started to analyze the differential expression with Deseq. The results I got were few genes differentially expressed as in the example below:

out of 20328 with nonzero total read count
LFC> 0 (up): 1, 0.0049%
LFC <0 (down): 4, 0.02%
outliers [1]: 22, 0.11%
low counts [2]: 0, 0%
(mean count <0)


Most of the analyzes were made using the “Beginner's guide to using the DESeq2 package”, by Love, M .; Anders, S. and Huber, W. I did the test with EdgeR, but the results were similar. To check if I did something wrong, I used the data that was used in the mentioned guide, data from the "parathyroidSE" package and also the raw RNA sequencing data mentioned in this guide so that I could do the same steps that I did with my data. I used the same parameters that I used in my data. This was done to check if something I did before the differential expression analysis, had an influence on the low results I found. The results were as follows:

out of 32082 with nonzero total read count
LFC> 0 (up): 38, 0.12%
LFC <0 (down): 44, 0.14%
outliers [1]: 0, 0%
low counts [2]: 24200, 75%
(mean count <210)


results with the parameters I used in my alignment:

out of 25980 with nonzero total read count
LFC> 0 (up): 14, 0.054%
LFC <0 (down): 14, 0.054%
outliers [1]: 0, 0%
low counts [2]: 22,946, 88%
(mean count <116)


As you can see, there is a smaller amount of genes when I used the steps I did in my alignment. Can anyone help me with suggestions of what may have happened?

Thanks!

RNA-Seq alignment sequencing • 641 views
3
Entering edit mode

Have you confirmed that your line of code adding the counts from the two lanes worked as expected? You should also include your edgeR code in the question as well.

Also, for sample quality control I would recommend extracting the normalized counts from the edgeR object using the cpm function, generating a scree plot and PCA plot using the PCAtools library, and then checking whether samples are clustering as expected.

1
Entering edit mode

Thanks for the suggestions! The edgeR code is:

1 - created Dgelist

dge_N_agua=DGEList(fc_Nel_agua, group = coldata_N_agua$Group)  2 - Normalize counts dge_N_agua=calcNormFactors(dge_N_agua, method = 'TMM')  3 - Common and tagwise dispersion dge_N_agua=estimateCommonDisp(dge_N_agua) dge_N_agua=estimateTagwiseDisp(dge_N_agua)  4 - Test between groups dge_N_agua_test=exactTest(dge_N_agua, pair = c('Menos','Mais'))  5 - Results res_dge_N_agua=topTags(dge_N_agua_test, n=nrow(dge_N_agua_test$table))
sum(res_dge_N_agua$table$FDR <0.1)
[1] 31


I have 66 samples in total, but divided in groups, the DE analysis was performed in 10 samples, divided into 2 groups (5 and 5). I hadn't mentioned this before. The PCA did not show a separation between the two groups, would that be expected? I don't know if the Scree Plot was as expected.

2
Entering edit mode

As @ATpoint stated, there is no clear separation of samples in the first two PCs. Since the first two PCs explain about 60% of the variance in the data, this means that there is either no strong difference between the groups, or that the noise in the data is much higher than the real effect. It might be worth exploring more PCs to see if any dimension captures the difference, since 40% of the variance in the data is not accounted for by the first 2 PCs. You can do this with the pairsplot function of PCAtools.

0
Entering edit mode

Thanks for the explanations and suggestions. I did the pairplot and I think there's not much else to do.

0
Entering edit mode

Can you color by group in the pairsplot also?

1
Entering edit mode
12 months ago

The obvious answer is that your analysis is correct that there are no DE genes between whatever you are comparing. Did the PCA look like a big mess?

2
Entering edit mode

No DE genes would be pretty surprising with 66 samples. You should be able to detect pretty small changes even with high variances if you've got 66 samples.

1
Entering edit mode

Has the image above shows, per group, look like a little mess...!

2
Entering edit mode

There is no obvious group separation betwee the samples you show and sample size is limited with n=5 per group, therefore results are not unexpected (=few DEGs).