DEseq2 results differs greatly when comparing output from STAR/featureCounts and Salmon
2
1
Entering edit mode
17 months ago
testtube ▴ 20

I have a standard RNA-seq dataset, 3 replicates of treated and control. I have obtained the read counts according to 2 different pipelines, first with STAR and featureCounts (with ENCODE preferred options) and second with Salmon.

First pipeline (STAR -> featureCount -> DEseq2)

STAR \
--limitBAMsortRAM 31000000000 \
--genomeDir STAR_index \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin  1 \
--outFilterMismatchNmax 999 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outFileNamePrefix results/STAR/control_1/ \
--outSAMtype BAM SortedByCoordinate \
--chimSegmentMin 20

featureCounts \
-p \
-B \
-C \
-T 1 \
-s 2 \
-a Homo_sapiens.GRCh38.94.gtf \
-o control_x.txt \
Aligned.sortedByCoord.out.bam


DEseq2

dds <- DESeqDataSetFromMatrix(countData = countData, #a dataframe of the read counts
colData = colData,
design = ~ conditions, tidy=TRUE)
dds <- DESeq(dds)
deseq.results <- results(dds, contrast=c("conditions", "treated", "control"))


Second pipeline (Salmon -> DEseq2)

salmon quant \
-i samlon_index \
-p 4 \
-l A \
-1 R1.fastq.gz \
-2 R2.fastq.gz \
--validateMappings \
--gcBias \
-o control_x


DEseq2

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~conditions)
dds <- DESeq(dds)
deseq.results <- results(dds, contrast=c("conditions", "treated", "control"))


When I try to correlated the log2FoldChange between but output, I get fairly low correlation (about 0.2). Furthermore, I get 49 significantly DE genes from STAR and 64 from Salmon, of which only 3 are shared. From the literature, I was expecting much better correlations.

Am I doing something wrong in the process? If not, what can explain such wide difference?

Thanks!

EDITS

Of notes, I am not expecting many DEGs with this dataset. This has also been the case with other datasets with a greater number of DEGs.

plotMA(dds.star)

plotMA(dds.salmon)

RNA-Seq DEseq2 salmon STAR • 1.3k views
0
Entering edit mode

Hmm, in any case you almost have no DEGs, I'd call this consistent. Can you show MA-plots of both pipelines? plotMA function.

0
Entering edit mode

Please see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here)

0
Entering edit mode

Why do you allow a read to map at 20 positions with the parameter --outFilterMultimapNmax 20? Feature counts only counts uniquely mapped reads anyway.

The parameters -alignSJoverhangMin and --alignSJDBoverhangMin will not affect the resulting bam file if you don't state the parameter --outFilterType BySJout. Also --alignSJDBoverhangMin 1 is very low. A read only needs one nucleotide right at one side of the annotated junction to be aligned to it.

Are you sure, you want to allow chimeric alignment with the --chimSegmentMin 20 parameter for DGE? This will incluse reads mapping to 2 chromosomes.

0
Entering edit mode

Quite frankly, I used these parameters by default now. I have spent some time over different projects with various parameters and when I used these, which come from the ENCODE protocols, I systematically get higher uniquely mapped reads. I tend to simply move on when they give me uniquely mapped reads above 0.9. Filter like --outFilterMultimapNmax are kept as I used the BAM file for other pipelines in parallel of featureCount. --outFilterType BySJout is simply missing from the cut and paste, sorry for the confusion.

3
Entering edit mode
17 months ago

The long and the short of it is that if you were expecting much of a biological signal, then hen you would expect the correlation between any two methods to be low.

Each logfoldchange from each method is made up of a biological signal and random method specific error. What a low correlation is telling you is that the error is larger than the biological signal. But this could either be because error is large or the biological signal is small.In the case where you have so few DEGs from either pipeline, I'd suggest it is because the biological signal is small, rather than because the error is large.

0
Entering edit mode

Thanks, this makes sense. But for such cases, which list can I trust, or at least should prefer as the basis for further biological validation and experiments?

0
Entering edit mode

Are the lists of DE genes very different? I wouldn't trust any of the LFCs from either method if they are not DE. DESeq2 will output and SE for for the logFoldChange estimate, so you should be able to calculate confidence intervals for the logFoldChange of each gene. You'll note that even though there is not much correlation, the 95CI for a given gene will overlap between the two methods.

0
Entering edit mode

That's my main problem, the significant DE (baseMean >= 1, abs(l2FC) >= 1, padj <= 0.05) lists are very different.

0
Entering edit mode

Can you post the read counts from each method for a couple of genes that are not shared (so maybe two Salmon only genes and two star only genes)?

0
Entering edit mode

DE in salmon but not STAR

gene_name gene_biotype treated_1 treated_2 treated_3 control_1 control_2 control_3 baseMean log2FoldChange lfcSE stat pvalue padj

SALMON DE gene_A protein_coding 22.46318 19.45674 8.177063 6.213502 3.343609 3.186051 399.5247 1.974991 0.4897088 4.032992 5.51E-05 0.02862106

STAR gene_A protein_coding 3.413084 2.758493 3.810379 2.969028 2.628057 3.54241 1237.235 0.1266081 0.1780993 0.710885 0.4771555 0.9363955

SALMON DE gene_B protein_coding 12.68585 22.78544 36.49626 6.471104 8.401435 6.329655 380.0308 1.764476 0.4549843 3.878104 0.000105274 0.0417309

STAR gene_B protein_coding 6.691679 6.69996 6.80812 6.369242 5.623952 7.263985 1383.059 0.06821578 0.1497678 0.455477 0.6487661 0.970134

DE in STAR but not Salmon

gene_name gene_biotype treated_1 treated_2 treated_3 control_1 control_2 control_3 baseMean log2FoldChange lfcSE stat pvalue padj

STAR_DE gene_C protein_coding 0.3752011 0.270547185 0.3296796 0.068913674 0.09121066 0.110017813 143.84946 1.852127 0.4110528 4.505814 6.61E-06 0.00877797

SALMON gene_C protein_coding 1.264072 1.66924 1.637469 0.3487068 0.4985691 0.7120189 198.6988 1.547013 0.4738074 3.265068 0.001094378 0.149393

STAR_DE gene_D misc_RNA 20.70642 22.31492 35.49361 7.267563 11.09276 7.504352 187.2268 1.603909 0.388531 4.128136 3.66E-05 0.02496008

SALMON gene_D misc_RNA 43.51399 61.40704 76.33679 22.38289 29.24554 33.99015 149.1495 1.080808 0.5071658 2.131074 0.03308309 0.60378

DE in both STAR and Salmon

gene_name gene_biotype treated_1 treated_2 treated_3 control_1 control_2 control_3 baseMean log2FoldChange lfcSE stat pvalue padj

SALMON_DE gene_E lincRNA 146.2584 163.6223 123.8516 51.98941 53.76095 24.2649 511.6784 1.740168 0.3898264 4.463957 8.05E-06 0.00792281

STAR_DE gene_E lincRNA 34.96362 33.74537 32.17363 7.396533 8.938845 4.390962 275.242 2.28813 0.3311117 6.910448 4.83E-12 8.72E-08

SALMON_DE gene_F snoRNA 162.2123 234.9861 251.9385 61.97713 75.17748 55.99916 232.0298 1.749954 0.4163297 4.20329 2.63E-05 0.016431

STAR_DE gene_F snoRNA 39.65257 54.31189 56.14673 12.45087 17.39486 11.41469 244.6545 1.86568 0.3325253 5.610641 2.02E-08 0.000133807

0
Entering edit mode

Are these counts, or CPMs? Are they the raw counts or the normalised counts from DEseq2? I'm assuming they must be bceause of the decimal places, even in the STAR/featureCount counts?

The values in the baseMean column don't correlated with the numbers in the counts:

gene_A counts for salmon are either around 7 or around 20, and the baseMean is 399. gene_D counts for salmon are either around 60 or 30 and the baseMean is 149.

Unless i'm mssing something about how DESeq2 normalision/tximport effect length normalisation works, this seems a bit odd.

0
Entering edit mode

They are FPKM fpkm(dds)

0
Entering edit mode

Can you post the raw counts - thats what DESeq works on.

0
Entering edit mode

Thanks you for your help. Looking at the raw count I'm even more confused.

DE in salmon but not STAR (featureCount raw count)

  treated_1   treated_2   treated_3   control_1   control_2   control_3

SALMON DE   gene_A  237.813 125.654 258.512 634.975 322.316 458.968

STAR    gene_A  1400    1017    1822    1320    874 1185

SALMON DE   gene_B  209.6   198.426 437.987 226.612 327.427 577.47

STAR    gene_B  1627    1179    2024    1402    1150    1147


DE in STAR (featureCount raw count) but not Salmon

          treated_1   treated_2   treated_3   control_1   control_2   control_3

STAR_DE gene_C  58  63  101 259 153 183

SALMON  gene_C  89.88   84.395  261.783 318.113 181.025 203.496

STAR_DE gene_D  95  119 107 222 196 306

SALMON  gene_D  80.831  91.832  150.993 143.221 179.031 224.129


DE in STAR (featureCount raw count) and Salmon

          treated_1   treated_2   treated_3   control_1   control_2   control_3

SALMON_DE   gene_E  303.797 274.08  174.756 783.409 775.681 591.902

STAR_DE gene_E  122 121 79  473 374 350

SALMON_DE   gene_F  111 118 123 263 334 36

STAR_DE gene_F  116 133 116 303 340 345

1
Entering edit mode

At least in some of those you are getting far high counts in that results than in the salmon ones. Can you take a look at the logs coming out of salmon? Particulalry I'd be interesting in seeing a) what library type it had determined and the fraction of reads compatible with that library type, and b) what the fraction of reads aligned is.

Also I noticed that one of these genes is a snoRNA - snoRNA are highly repetitive, and repetitive genes are something that is going to differ between salmon and STAR. Are other discordant genes also repetitive?

0
Entering edit mode

Control_1 - lib_format_counts.json (all are similar)

"expected_format": "ISR", "compatible_fragment_ratio": 1.0, "num_compatible_fragments": 54316218, "num_assigned_fragments": 54316218, "num_frags_with_concordant_consistent_mappings": 49773378, "num_frags_with_inconsistent_or_orphan_mappings": 4718041, "strand_mapping_bias": 0.00000996506721578473, "MSF": 0, "OSF": 0, "ISF": 496, "MSR": 0, "OSR": 0, "ISR": 49773378, "SF": 1896980, "SR": 2820565, "MU": 0, "OU": 0, "IU": 0, "U": 0

The 3 genes in common are 2 lincRNA and 1 snoRNA

The STAR only DE are 8 protein coding, 5 snRNA, 3 antisense, 2 sens_intronic, 2 lincRNA, 1 unprocessed pseudogene, 1 process transcript, and 1 miscRNA

The Salmon only DE are 48 protein coding, 4 lincRNA, 2 antisense, 1 transcribed_unprocessed_pseudogene, 1 transcribed_unitary_pseudogene, 1 snoRNA, and 1 process transcript

So I dont think that repetitivity is an issue

1
Entering edit mode

Well ~25% of your STAR only genes are snRNAs, and there are only, like 9 snRNAs, but they are expressed from 100s of genes: that is, each snRNA is present in multiple copies - i.e. they are repetative.

Also, in normal RNA-seq I would expect to see very little in the way of snRNAs, because they are shorter than the usual minmum fragment length for RNAseq. Same goes for snoRNAs actaully.

One difference between featureCounts and Salmon, is that a read only has to an exon by a few bp to be counted, as long as it doesn't also overlap a different gene, so a read that was mostly intronic would probably get counted by featureCounts, whereas salmon will not count these reads I don't think.

1
Entering edit mode
17 months ago
Rob 4.9k

What is the correlation among expression estimates? If there is little to no true DE, then logFCs will be close to 0 for many genes and correlations will be quite low. Have you tried quantifying with salmon using STAR's mappings (projected to the transcriptome)?

0
Entering edit mode

Thanks. The correlation between the average of the 3 "control" replicates is 0.53 and the correlation between the average of the 3 "treated" replicate is 0.55. Lower than I would have expected. I'm not sure what you mean by "projected to the transcriptome".

0
Entering edit mode

That is still quite low, of course. By "projected to the transcriptome", I mean using STAR's --quantMode TranscriptomeSAM, and getting rid of the sorting by coordinate (basically, the parameters that RSEM uses with STAR as the aligner, except also allowing indels in the alignment). This will at least let you isolate the difference to the mappings or to the quantification model.

0
Entering edit mode

Thank you for your suggestion. The previous correlation were of FPKM. It might not have been ideal. I get the following correlations:

control_salmon treated_salmon control_star_salmonquant treated_star_salmonquant control_star treated_star control_salmon 1.00
treated_salmon 0.99 1.00
control_star_salmonquant 0.89 0.87 1.00
treated_star_salmonquant 0.91 0.91 0.99 1.00
control_star 0.78 0.75 0.92 0.89 1.00
treated_star 0.80 0.79 0.91 0.90 0.99 1.00

FPKM

control_salmon treated_salmon control_star_salmonquant treated_star_salmonquant control_star treated_star control_salmon 1.00
treated_salmon 1.00 1.00
control_star_salmonquant 0.99 0.98 1.00
treated_star_salmonquant 0.99 0.99 1.00 1.00
control_star 0.53 0.52 0.55 0.53 1.00
treated_star 0.55 0.55 0.56 0.55 0.99 1.00

So clearly DEseq2 calculates FPKM differently for featureCount reads. I don't know if it has an influence on DE calculation. Can it be a clue into something else going on?

Furthermore, when looking at DE genes from STAR-mapping --quantMode TranscriptomeSAM and quantifying with salmon quant I get another list of DE genes that show little overlap with the other 2 methods.