Question: Low DE genes in DESeq2 (update PCA plots)
gravatar for CandiceChuDVM
3.0 years ago by
United States/College Station/Texas A&M University
CandiceChuDVM1.8k wrote:

Hi all,

I want to use RNA-seq data to discover DE genes in dogs with a specific hereditary disease versus healthy dogs. I used Illumina Hiseq (paired-end) to sequence 8 samples (rapid=3, slow=3, control=2). Each of them has ~30M reads. However, I only got 15 genes with adjusted p-values were less than 0.1 in DEseq. Could somebody point our what's wrong? Here are my commands and outputs in each step:

Genome and annotation


HISAT2 Commands

hisat2-build -p 20 $GENOME canFam3

python $ANNOT > splicesites.txt

hisat2 -p 20 \
   -q \
   --dta \
   --known-splicesite-infile $WORKDIR/splicesites.txt \
   -x $HST2IDX \
   -1 R1_combined.fastq.gz \
   -2 R2_combined.fastq.gz \
   -S align.sam

HISAT2 Outputs

samtools flagstat align.sam

65458550 + 0 in total (QC-passed reads + QC-failed reads)
6429010 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
62834643 + 0 mapped (95.99%:-nan%)
59029540 + 0 paired in sequencing
29514770 + 0 read1
29514770 + 0 read2
54628232 + 0 properly paired (92.54%:-nan%)
55418058 + 0 with itself and mate mapped
987575 + 0 singletons (1.67%:-nan%)
225944 + 0 with mate mapped to a different chr
200390 + 0 with mate mapped to a different chr (mapQ>=5)

hisat2 log:

29514770 reads; of these:
  29514770 (100.00%) were paired; of these:
2200654 (7.46%) aligned concordantly 0 times
22741847 (77.05%) aligned concordantly exactly 1 time
4572269 (15.49%) aligned concordantly >1 times
2200654 pairs aligned concordantly 0 times; of these:
  317519 (14.43%) aligned discordantly 1 time
1883135 pairs aligned 0 times concordantly or discordantly; of these:
  3766270 mates make up the pairs; of these:
    2623907 (69.67%) aligned 0 times
    808718 (21.47%) aligned exactly 1 time
    333645 (8.86%) aligned >1 times

95.55% overall alignment rate

Given that I got ~95% alignment rate, I suspect the problem lies in the following steps.

IGV image

Here is an IGV image from a sample. Is it well-aligned?

enter image description here

Samtools Commands

samtools view -S -u align.sam -o  align.bam

samtools sort -n -@ 20 -m 2G align.bam align.namesorted

HTseq Commands

htseq-count -f bam -i gene_id -r name -s no -t exon align.namesorted.bam $ANNOT > count.txt

HTseq Outputs

In count.txt:

(24580 lines of gene_id and counts are omitted here)
__no_feature    11982150
__ambiguous     250644
__too_low_aQual 1603475
__not_aligned   818166
__alignment_not_unique  4799372

DESeq2 Commands

sampleFiles <- grep("count",list.files(directory),value=TRUE) 
condition <- factor(c(rep("control",2), rep("rapid",3), rep("slow",3)))
timepoints <- factor(rep("t1",8))
sampleTable <- data.frame(sampleName = sampleFiles, 
                      fileName = sampleFiles, 
                      condition=condition, timepoints=timepoints)
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, 
                                   directory = directory, 
                                   design= ~condition)
dds <- dds[ rowSums(counts(dds)) > 3, ]
dds$condition <- relevel(dds$condition, ref="control")
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition","rapid","control"))
sum(res$padj < 0.1, na.rm=TRUE)

sum(res$padj < 0.1, na.rm=TRUE) is the step that gave me only got 15 DE genes.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by CandiceChuDVM1.8k

htseq-count has an option to specify that your reads are sorted by name, -r name Not sure what's the default...

You lose a lot of counts in the __no_feature section o.o

ADD REPLYlink written 3.0 years ago by WouterDeCoster38k

One reason to use featureCounts. It sorts your files automatically before counting and it is fast.

ADD REPLYlink written 3.0 years ago by genomax65k

I redid the htseq-count with -r name command. It does not change the result :(

ADD REPLYlink written 3.0 years ago by CandiceChuDVM1.8k
  1. You didn't show how dds was made.
  2. timepoints is pointless
  3. What makes you think the results are wrong?
  4. Have you looked at the alignments in IGV? That's the only way to get a grip on why only a smallish percent of the alignments are high enough quality and on exons.
ADD REPLYlink written 3.0 years ago by Devon Ryan89k

Hi Devon,

  1. Sorry that I missed the dda step. I have updated it my post.
  2. timepoints is a prebuild column for future analysis. I only have t1 data now but I am going to analyze t1~t3 in the future. I know it's pointless now since I only have t1 data but I include it to remind myself for future analysis.
  3. The rapid group and slow group are dogs with a hereditary disease so I expect to see a lot of DE genes compare to the control group. Also, I think the depth of my RNA-seq data is deep enough (~30M reads) to give me more than 15 DE genes. I am worried that the number of DE gene would be too small to allow me to do IPA.
  4. I haven't checked IGV yet. Let me check then I will post my results.

I appreciate your suggestions!

ADD REPLYlink written 3.0 years ago by CandiceChuDVM1.8k

I have uploaded the IGV image. It looks well-aligned to me.

ADD REPLYlink written 3.0 years ago by CandiceChuDVM1.8k

Presumably you have a high degree of variability.

ADD REPLYlink written 3.0 years ago by Devon Ryan89k
gravatar for Kristin Muench
3.0 years ago by
United States
Kristin Muench410 wrote:

How noisy is your data? It is possible that differentially expressed genes are hard to detect because of expression variance. Have you tried plotting the first two principal components of your data, or making a dendrogram of principal components? Ideally, it would separate into healthy and hereditary-variant-of-interest groups. If not, is it possible there is a batch effect in your data that you need to account for? e.g., with R package sva, and its ComBat function

ADD COMMENTlink written 3.0 years ago by Kristin Muench410

Thanks, Kristin! I have the dendrogram when I use Tuxedo suite to analyze my data: enter image description here However, I haven't tried to do PCA with this new analysis.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by CandiceChuDVM1.8k

With DESeq2 you could simply use this:

dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory,  design= ~condition)
rld <- rlog(dds) # rlog transformation of your data as described in DESeq2 vignette
plotPCA(rld, intgroup="condition")

and obtain a better overview of the variabily across sample.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by VHahaut1.1k

Thanks! I did the PCA plot and discovered my inter-condition variance is very big. They are much more similar timepoint-wise rather than condition-wise :( I wonder why the results from Cuffdiff and DEseq2 are so different. PCA plot based on condition PCA plot based on timepoints

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by CandiceChuDVM1.8k

I've also seen really big differences between DE genes called by cuffdiff and DE genes called by DESeq2 in my data. I'm not sure why, but in general people put much more faith into the DESeq2 results. I'm guessing the differences could arise because of differences in how cuffdiff and DESeq do the test - for example, cuffdiff normalizes genes by recasting expression as FPKMs, and FPKM values can potentially be skewed by changes in expression of large genes. DESeq has a more subtle way of adjusting expression levels for library size.

P.S. A note on dendrograms - I've heard it's better/more informative to do dendrograms on principal components rather than raw data because PCs are orthogonal/independent, whereas raw data might not be.

ADD REPLYlink written 3.0 years ago by Kristin Muench410
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1591 users visited in the last hour