Question

Does RSEM ignore STAR's splice-aware nature?

0

Entering edit mode

20 months ago

bioinfo2345 ▴ 40

I am currently doing an RNA-seq project with differential expression where I am using STAR as an aligner and RSEM for quantification. The project uses a reference genome with a GFF file containing information about the location of transcripts and introns.

From what I have read, RSEM cannot handle gapped alignments.

How much of a problem is this?
Does this mean that the benefit of using annotations about where introns are etc. are not used by RSEM? That is, does the extra things that STAR does (since it is splice-aware), not benefit the analysis in the end?
Should I be using a different quantification tool than RSEM to make the most use of this annotation information? Are there any robust alternatives that also uses EM to handle multi-mapping reads?
Or is it just as good to use, e. g. HTSeq-count?

RSEM STAR • 1.0k views

ADD COMMENT • link updated 20 months ago by Rob 6.5k • written 20 months ago by bioinfo2345 ▴ 40

score 1 · Answer 1 · 2022-08-01

RSEM uses reads aligned to the transcriptome, not to the genome. As far as I'm aware this is true of all alignment-based EM transcript quantification tools (its definitely true of salmon when quantifying pre-mapped reads). Since the transcriptome contains the sequences of transcripts after introns have been removed, then there should not be gapped reads in a transcriptome alignment (as long as the transcriptome is sufficiently correctly annotated).

 read sequence: ATGATGATGGGTGGTGGT

 alignment to g: MMMMMMMMMnnnnnnMMMMMMMMM - gapped alignment
 genome seq:     ATGATGATGCGACGAGGTGGTGGT
 transcript:     |>>>>>>>|------|>>>>>>>|
                          \    /
                           \  /
                            \/
                    |>>>>>>>||>>>>>>>|
transcript seq:     ATGATGATGGGTGGTGGT
alignment to trans: MMMMMMMMMMMMMMMMMM - ungapped alignment

Traditionally this is done by aligning using a non-splice aware aligner to a fasta file of transcript sequences. However, one of the many cool features of STAR is that it can do spliced alignment of reads to the genome (gapped), and then use the provided GFF to output the coordinates in transcript space (which should know be ungapped), ready for use by RSEM or Salmon etc.