Deferentially expressed gene with high log2foldchange by DESeq2; but not meaningful at the individual level
Entering edit mode
8 days ago
Sara ▴ 10

Hi all,

I am working with the RNA-Seq data on human (24Cases-20 controls) to find differentially expressed genes. my RNA-Seq data is unstranded. Here is the comments that I used to align the fastq files:

ls *_1P.fastq.gz | parallel --bar -j8 'R2=$(echo {} | sed s/_1/_2/) && out=$(echo {} | sed s/_1P// | sed s/.fastq.gz/.sam/) && rg_sm=$(echo {} | cut -d. -f1) && hisat2 -1 {} -2 $R2 -S $out --dta -x hg38_indexed -p 4 --rg SM:$rg_sm --rg ID:** --rg PL:RNAseq --rg LB:Illumina --rg PU:MAPRSeqV3'

1) My first question here is whether my command ok. Did I miss something? Also, I used --dta but I am not sure whether it can affect the results or not?

Then, I used this command to generate the count files:

ls *.bam| parallel --bar -j8 'out=$(echo {} | sed s/.bam/.count/) && htseq-count -f bam -r pos -s no -q {} genes_hg38_ensGene.gtf.gz > $out'

After generating the count files, I started to analyze them with DESeq2. I used the principal component of cell markers as a covariate in my design matrix (PC1, PC2).

dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,

dds<- DESeq(dds) 

colData(dds)$condition <- relevel(colData(dds)$condition, ref = "Control")

Then, I got the list of differentially expressed genes. One of the genes that I was interested in it, has meaningful padj value.

A gene of interest: Gene X:

  • baseMean: 58.84019698 - log2FoldChange: -3.04286494 - lfcSE: 0.997350372 - stat: -3.050948819
  • pvalue: 0.002281195 - padj: 0.021997412

It has a high log2Foldchange and meaningful padj value.

But when I am looking at variance established counts of this gene; and make the plot of it, I can not see the meaningful difference between cases and controls.

Expression of GeneX

I would be happy for suggestions, explanations, and ideas to realize what is going wrong and what else I must check. Thanks in advance

DESeq2 RNA-seq DGEA • 302 views
Entering edit mode

A couple of items: 1) no comment on the first two items - hard to tell. you may want to generate QC plots for the nascent .bam and count files to see if they look "reasonable", but i cant comment on the syntax 2) with regard to the DE analysis: one possibility is that the two differ due to covariate control (whether PCs or another kind of annotation). suppose gene X makes you tall, and the covariate you are including is nutrition status. suppose all the cases (who have gene X) had realllly poor nutrition in childhood, but all the controls (no gene X) were very well fed and looked after. Then, you might have raw counts (heights; the Y variable) that are similar, but a significant p-value after covariate control. you could check this by looking at the PC1 and PC2 loadings for each sample and seeing if they are very different between the cases and controls.

id also double check that there isnt a label scrambling issue here - is it possible rownames got mixed up between the dds object and the res object? looks like two different genes are being described..


Login before adding your answer.

Traffic: 1146 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6