Question

Differentially expressed gene with high log2foldchange by DESeq2; but not meaningful at the individual level

0

Entering edit mode

10 months ago

Sara ▴ 30

Hi all,

I am working with the RNA-Seq data on human (24Cases-20 controls) to find differentially expressed genes. my RNA-Seq data is unstranded. Here is the commands that I used to align the fastq files:

ls *_1P.fastq.gz | parallel --bar -j8 'R2=$(echo {} | sed s/_1/_2/) && out=$(echo {} | sed s/_1P// | sed s/.fastq.gz/.sam/) && rg_sm=$(echo {} | cut -d. -f1) && hisat2 -1 {} -2 $R2 -S $out --dta -x hg38_indexed -p 4 --rg SM:$rg_sm --rg ID:** --rg PL:RNAseq --rg LB:Illumina --rg PU:MAPRSeqV3'

1) My first question here is whether my command ok. Did I miss something? Also, I used --dta but I am not sure whether it can affect the results or not?

Then, I used this command to generate the count files:

ls *.bam| parallel --bar -j8 'out=$(echo {} | sed s/.bam/.count/) && htseq-count -f bam -r pos -s no -q {} genes_hg38_ensGene.gtf.gz > $out'

After generating the count files, I started to analyze them with DESeq2. I used the principal component of cell markers as a covariate in my design matrix (PC1, PC2).

dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
                                  directory=folder,
                                  design=~Plate+RIN+Sex+Age+condition+PC2+PC1)

dds<- DESeq(dds) 

colData(dds)$condition <- relevel(colData(dds)$condition, ref = "Control")

Then, I got the list of differentially expressed genes. One of the genes that I was interested in it, has meaningful padj value.

A gene of interest: Gene X:

baseMean: 58.84019698 - log2FoldChange: -3.04286494 - lfcSE: 0.997350372 - stat: -3.050948819
pvalue: 0.002281195 - padj: 0.021997412

It has a high log2Foldchange and meaningful padj value.

But when I am looking at variance established counts of this gene; and make the plot of it, I can not see the meaningful difference between cases and controls.

Expression of GeneX

I would be happy for suggestions, explanations, and ideas to realize what is going wrong and what else I must check. Thanks in advance

DESeq2 RNA-seq DGEA • 643 views

ADD COMMENT • link 10 months ago by Sara ▴ 30

1

Entering edit mode

A couple of items: 1) no comment on the first two items - hard to tell. you may want to generate QC plots for the nascent .bam and count files to see if they look "reasonable", but i cant comment on the syntax 2) with regard to the DE analysis: one possibility is that the two differ due to covariate control (whether PCs or another kind of annotation). suppose gene X makes you tall, and the covariate you are including is nutrition status. suppose all the cases (who have gene X) had realllly poor nutrition in childhood, but all the controls (no gene X) were very well fed and looked after. Then, you might have raw counts (heights; the Y variable) that are similar, but a significant p-value after covariate control. you could check this by looking at the PC1 and PC2 loadings for each sample and seeing if they are very different between the cases and controls.

id also double check that there isnt a label scrambling issue here - is it possible rownames got mixed up between the dds object and the res object? looks like two different genes are being described..

ADD REPLY • link 10 months ago by LauferVA 4.2k