Question

BWA-MEM versus Salmon - how to interpret large differences between read outputs

0

Entering edit mode

4.5 years ago

gmchaput ▴ 10

I am analyzing differential gene expression for a plant based on 3'cDNA tag libraries generated from total RNA. This was started previously by another researcher where they used FASTX Toolkit for quality/trimming, BWA-MEM + HT-seq count, and then DESeq2 (back in 2015/2016).

I decided to go back and used the raw fastq files from the sequencing facility and instead use SortMeRNA + trimmomatic for quality/trimming, and then Salmon + tximport before using DESeq2.

What shocked me was the reads that mapped depending on the method. I get ~4 million reads that map via Salmon whereas the previous researcher had ~8 millions reads that mapped via BWA-MEM.

To troubleshoot, I used the FASTX Toolkit trimmed reads from the previous researcher and ran it through Salmon in case it was a filtering issue. I still had ~4 million reads compared to the ~8 million of these same files running through BWA-MEM.

I knew the two methods are very different and to expect some differences but by half causes much larger differences in the downstream results from DESeq2. Can someone lead me in the right direction with an explanation for this large of a difference or potentially what I am doing wrong (or maybe I am not doing something wrong?).

Not sure exactly what additional info people may need but I will be happy to elaborate if necessary!

BWA-MEM Salmon RNA-Seq alignment DESeq2 • 1.8k views

ADD COMMENT • link updated 4.5 years ago by ATpoint 81k • written 4.5 years ago by gmchaput ▴ 10

score 2 · Answer 1 · 2019-11-01

2

Entering edit mode

4.5 years ago

ATpoint 81k

Total mapping percentages are not informative as bwa maps against the genome and salmon quantifies against the transcriptome. You should compare the reads mapped / overlapping exons with the salmon quants. Check the distribution of aligned reads e.g. with qualimap to see if they align to exons or somewhere off-target which could indicate rRNA or genomic-DNA contamination. Also, bwa is not recommended for RNA-seq as it is not splice aware. Use something like STAR or Hisat2 for this to properly handle reads spanning exon/exon junctions. I would also not use fastx anymore as it is old, afaik not maintained and should be replaced by more recent tools like fastp, cutadapt or trimmomatic.

ADD COMMENT • link 4.5 years ago by ATpoint 81k

0

Entering edit mode

Thank you! So to make sure I understand your suggestion correctly: I should take the output from BWA-MEM as well as the Salmon output and run them through qualimap to see if there is of target mapping?

I agree and thought the same thing about BWA and FASTX, which is why I went with SortMeRNA+trimmomatic and Salmom; however, I needed to justify my reasonings to my advisor- so thank you for the explanation about BWA not being splice aware. Is it ok to use Salmon, or would you recommend STAR or Hisat2 instead?

ADD REPLY • link 4.5 years ago by gmchaput ▴ 10

2

Entering edit mode

~~Exclude BWA mem, it is not an RNA-seq aligner.~~ Edit 02.11.19: Just realized that this is 3'-end cDNA libraries, so splicing is probably not too much of a factor, not sure about this. Sorry if causing confusion. I still would use an RNA-seq mapper to be sure. /Edit_end/.

Choice between salmon (or other tools like kallisto) and traditional aligners are (from what I understand) active fields of research, but for the end-user a matter of preference (at least for me). I use salmon because it is fast and convenient (plus published, maintained, and accepted) and can be seamlessly integrated with tximport to summarize the abundance estimates to the gene level. Use what you feel comfortable with, but if you choose traditional alignments, use splice-aware mappers. Qualimap (among other tools) can analyze how many reads actually overlap with exons (so on-target) or other genomic regions (off-targets, indicating contamination or library prep. issues).

Edit-2: For a more detailed discussion see also this very recent preprint on exactly that matter from the group that develops salmon: https://www.biorxiv.org/content/10.1101/657874v2

ADD REPLY • link 4.5 years ago by ATpoint 81k