Hello everyone!
In my PhD thesis, I am working with differential expression analysis, trying to verify genes related to thermoregulation in beef cattle. I have 66 sequenced, paired-end RNA samples, divided into 2 lanes per sample. The first steps were to check the quality of these samples by FASTQC, then do the trimming (using the Trimmomatic software with the parameters: LEADING: 30 TRAILING: 30 SLIDINGWINDOW: 5: 20 AVGQUAL: 20 MINLEN: 50) and then checking the quality again of these samples by FASTQC. The FASTQC reports did not show anything out of the ordinary, so I proceeded to align the samples using the Hisat2 software.
All of these steps were done on the Linux terminal in bash. Post alignment I went to the R software, to count reads with FeatureCounts, the command I used was as follows:
fc <- featureCounts(files = fls,
annot.ext ='/home/.../Bos_taurus.ARS-UCD1.2.102.gtf' ,
isGTFAnnotationFile = TRUE,
GTF.featureType = "exon",
GTF.attrType = "gene_id",
useMetaFeatures = TRUE,
strandSpecific = 0,
isPairedEnd = TRUE,
nthreads = 30)
As my samples were in two lanes, I added the read count by samples with the following command:
cont_fc2=(cont_fc[1:(ncol(cont_fc)-1)] + cont_fc[2:ncol(cont_fc)])[c(T,F)]
That done, I started to analyze the differential expression with Deseq. The results I got were few genes differentially expressed as in the example below:
out of 20328 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 1, 0.0049%
LFC <0 (down): 4, 0.02%
outliers [1]: 22, 0.11%
low counts [2]: 0, 0%
(mean count <0)
Most of the analyzes were made using the “Beginner's guide to using the DESeq2 package”, by Love, M .; Anders, S. and Huber, W. I did the test with EdgeR, but the results were similar. To check if I did something wrong, I used the data that was used in the mentioned guide, data from the "parathyroidSE" package and also the raw RNA sequencing data mentioned in this guide so that I could do the same steps that I did with my data. I used the same parameters that I used in my data. This was done to check if something I did before the differential expression analysis, had an influence on the low results I found. The results were as follows:
results with the ready-made data from the “parathyroidSE” package:
out of 32082 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 38, 0.12%
LFC <0 (down): 44, 0.14%
outliers [1]: 0, 0%
low counts [2]: 24200, 75%
(mean count <210)
results with the parameters I used in my alignment:
out of 25980 with nonzero total read count
adjusted p-value <0.1
LFC> 0 (up): 14, 0.054%
LFC <0 (down): 14, 0.054%
outliers [1]: 0, 0%
low counts [2]: 22,946, 88%
(mean count <116)
As you can see, there is a smaller amount of genes when I used the steps I did in my alignment. Can anyone help me with suggestions of what may have happened?
Thanks!
Have you confirmed that your line of code adding the counts from the two lanes worked as expected? You should also include your edgeR code in the question as well.
Also, for sample quality control I would recommend extracting the normalized counts from the edgeR object using the
cpm
function, generating a scree plot and PCA plot using the PCAtools library, and then checking whether samples are clustering as expected.Thanks for the suggestions! The edgeR code is:
1 - created Dgelist
2 - Normalize counts
3 - Common and tagwise dispersion
4 - Test between groups
5 - Results
I have 66 samples in total, but divided in groups, the DE analysis was performed in 10 samples, divided into 2 groups (5 and 5). I hadn't mentioned this before. The PCA did not show a separation between the two groups, would that be expected? I don't know if the Scree Plot was as expected.
As @ATpoint stated, there is no clear separation of samples in the first two PCs. Since the first two PCs explain about 60% of the variance in the data, this means that there is either no strong difference between the groups, or that the noise in the data is much higher than the real effect. It might be worth exploring more PCs to see if any dimension captures the difference, since 40% of the variance in the data is not accounted for by the first 2 PCs. You can do this with the
pairsplot
function of PCAtools.Thanks for the explanations and suggestions. I did the pairplot and I think there's not much else to do.
Can you color by group in the pairsplot also?