Differential expression analysis with few genes. What could have happened?
1
1
Entering edit mode
3.3 years ago
rnaka09 ▴ 30

Hello everyone!

In my PhD thesis, I am working with differential expression analysis, trying to verify genes related to thermoregulation in beef cattle. I have 66 sequenced, paired-end RNA samples, divided into 2 lanes per sample. The first steps were to check the quality of these samples by FASTQC, then do the trimming (using the Trimmomatic software with the parameters: LEADING: 30 TRAILING: 30 SLIDINGWINDOW: 5: 20 AVGQUAL: 20 MINLEN: 50) and then checking the quality again of these samples by FASTQC. The FASTQC reports did not show anything out of the ordinary, so I proceeded to align the samples using the Hisat2 software.

All of these steps were done on the Linux terminal in bash. Post alignment I went to the R software, to count reads with FeatureCounts, the command I used was as follows:

fc <- featureCounts(files = fls, 
                    annot.ext ='/home/.../Bos_taurus.ARS-UCD1.2.102.gtf' , 
                    isGTFAnnotationFile = TRUE,
                    GTF.featureType = "exon", 
                    GTF.attrType = "gene_id", 
                    useMetaFeatures = TRUE, 
                    strandSpecific = 0, 
                    isPairedEnd = TRUE, 
                    nthreads = 30)

As my samples were in two lanes, I added the read count by samples with the following command:

cont_fc2=(cont_fc[1:(ncol(cont_fc)-1)] + cont_fc[2:ncol(cont_fc)])[c(T,F)]

That done, I started to analyze the differential expression with Deseq. The results I got were few genes differentially expressed as in the example below:

out of 20328 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 1, 0.0049%
LFC <0 (down): 4, 0.02%
outliers [1]: 22, 0.11%
low counts [2]: 0, 0%
 (mean count <0)

Most of the analyzes were made using the “Beginner's guide to using the DESeq2 package”, by Love, M .; Anders, S. and Huber, W. I did the test with EdgeR, but the results were similar. To check if I did something wrong, I used the data that was used in the mentioned guide, data from the "parathyroidSE" package and also the raw RNA sequencing data mentioned in this guide so that I could do the same steps that I did with my data. I used the same parameters that I used in my data. This was done to check if something I did before the differential expression analysis, had an influence on the low results I found. The results were as follows:

results with the ready-made data from the “parathyroidSE” package:

out of 32082 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 38, 0.12%
LFC <0 (down): 44, 0.14%
outliers [1]: 0, 0%
low counts [2]: 24200, 75%
(mean count <210)

results with the parameters I used in my alignment:

out of 25980 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 14, 0.054%
LFC <0 (down): 14, 0.054%
outliers [1]: 0, 0%
low counts [2]: 22,946, 88%
(mean count <116)

As you can see, there is a smaller amount of genes when I used the steps I did in my alignment. Can anyone help me with suggestions of what may have happened?

Thanks!

RNA-Seq alignment sequencing • 1.2k views
ADD COMMENT
3
Entering edit mode

Have you confirmed that your line of code adding the counts from the two lanes worked as expected? You should also include your edgeR code in the question as well.

Also, for sample quality control I would recommend extracting the normalized counts from the edgeR object using the cpm function, generating a scree plot and PCA plot using the PCAtools library, and then checking whether samples are clustering as expected.

ADD REPLY
1
Entering edit mode

Thanks for the suggestions! The edgeR code is:

1 - created Dgelist

dge_N_agua=DGEList(fc_Nel_agua, group = coldata_N_agua$Group)

2 - Normalize counts

dge_N_agua=calcNormFactors(dge_N_agua, method = 'TMM')

3 - Common and tagwise dispersion

dge_N_agua=estimateCommonDisp(dge_N_agua)
dge_N_agua=estimateTagwiseDisp(dge_N_agua)

4 - Test between groups

dge_N_agua_test=exactTest(dge_N_agua, pair = c('Menos','Mais'))

5 - Results

res_dge_N_agua=topTags(dge_N_agua_test, n=nrow(dge_N_agua_test$table))
sum(res_dge_N_agua$table$FDR <0.1)
[1] 31

I have 66 samples in total, but divided in groups, the DE analysis was performed in 10 samples, divided into 2 groups (5 and 5). I hadn't mentioned this before. The PCA did not show a separation between the two groups, would that be expected? I don't know if the Scree Plot was as expected.

enter image description here

scree PLOT

ADD REPLY
2
Entering edit mode

As @ATpoint stated, there is no clear separation of samples in the first two PCs. Since the first two PCs explain about 60% of the variance in the data, this means that there is either no strong difference between the groups, or that the noise in the data is much higher than the real effect. It might be worth exploring more PCs to see if any dimension captures the difference, since 40% of the variance in the data is not accounted for by the first 2 PCs. You can do this with the pairsplot function of PCAtools.

ADD REPLY
0
Entering edit mode

Thanks for the explanations and suggestions. I did the pairplot and I think there's not much else to do.

pairplot

ADD REPLY
0
Entering edit mode

Can you color by group in the pairsplot also?

ADD REPLY
1
Entering edit mode
3.3 years ago

The obvious answer is that your analysis is correct that there are no DE genes between whatever you are comparing. Did the PCA look like a big mess?

ADD COMMENT
2
Entering edit mode

No DE genes would be pretty surprising with 66 samples. You should be able to detect pretty small changes even with high variances if you've got 66 samples.

ADD REPLY
1
Entering edit mode

Has the image above shows, per group, look like a little mess...!

ADD REPLY
2
Entering edit mode

There is no obvious group separation betwee the samples you show and sample size is limited with n=5 per group, therefore results are not unexpected (=few DEGs).

ADD REPLY

Login before adding your answer.

Traffic: 2275 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6