HISAT2 end-to-end and --no-spliced-alignment mode returns gapped alignments
2
1
Entering edit mode
5.2 years ago

I ran HISAT2 (index built using a transcriptome multi fasta) intending that it won't perform gapped alignment. I use following script to run HISAT:

INDEX=./indices/hisat/transcriptome
FASTQ=1 OUTPUT=./transcriptome_aligned/2.sam
./software/hisat-0.1.6-beta/hisat \
-q \
-p 2 \
--no-spliced-alignment \
--end-to-end \
-x $INDEX \ -U$FASTQ \
-S OUTPUT  Should I still expect gapped alignment in my SAM file? I have records like this in the SAM output. SRR2144041.255 0 YCL025C 274 255 16M1I33M * 0 0 CAGGCTCAAGAACTAGAAAAAAAATGAAAGTTCGGACAACATAGGCGCTA CCCFFFFFHHHHHJJJJJJJJJJJJJJJIJGIIIIJJJJJIJJIIJJJHH AS:i:-8 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:49 YT:Z:UU NH:i:1  This shows that HISAT2 is still performing gapped alignment even with --end-to-end and --no-splice-alignment parameters. I'm trying to use the output SAM for rsem-calculate-expression but it returns following error due to presence of gapped alignment: rsem-parse-alignments ./indices/rsem/rsem ./rsem_output/sample.temp/sample ./rsem_output/sample.stat/sample ./transcriptome_aligned/sample.bam 1 -tag XM Read SRR2144041.836747: RSEM currently does not support gapped alignments, sorry! "rsem-parse-alignments ./indices/rsem/rsem ./rsem_output/sample.temp/sample ./rsem_output/sample.stat/sample ./transcriptome_aligned/sample.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!  How do I make sure that HISAT2 doesn't perform gapped alignment? Should I filter the output for using grep -v XO:i:0? EDIT: I checked RSEM manual and found that in order to avoid gapped alignments using Bowtie2, RSEM uses following Bowtie2 parameters: --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1  I wonder what is the equivalent of --gbar in HISAT2 Thanks RNA-Seq hisat samtools rsem • 5.0k views ADD COMMENT 5 Entering edit mode 5.2 years ago Solved it: Included following two parameters in HISAT2 run command to make gap opening and extension insanely difficult. --rdg 10000,10000 --rfg 10000,10000  My HISAT2 script now looks like this: INDEX=./indices/hisat/transcriptome FASTQ=1
OUTPUT=./transcriptome_aligned/2.sam ./software/hisat-0.1.6-beta/hisat \ -q \ -p 2 \ --no-spliced-alignment \ --end-to-end \ --rdg 10000,10000 \ --rfg 10000,10000 \ -xINDEX \
-U $FASTQ \ -S$OUTPUT


Hope it helps, if anybody is trying to run RSEM downstream of HISAT2.

This solution might not be foolproof. Improvements on this answer are invited.

4
Entering edit mode
5.2 years ago

First I would suggest that you make certain that you really want to do that. Gapped alignments indicate that your measured genome has insertions and deletions relative to the reference, be that genomic or transcriptomic. That is very common and is to be expected. Forcing a read to align elsewhere would be most likely be the wrong choice!

If you really want to turn off gaps then raise the cost of gap open and that should do it.

If a tool cannot handle gapped alignments then it is an obsolete and wrong tool that should not be used! Suggesting parameters like the one you read just to avoid gaps IMHO is terrible practice. Nothing good will come from that.

0
Entering edit mode

But RSEM is one of the revered tools for isoform quantification. If exclusion of gapped alignment is such a serious issue then, why has it not been highlighted in any literature for gene quantification.

In fact newer tools like Kalliso used RSEM for most of benchmarking.

2
Entering edit mode

While I probably would not go so far is to claim that RSEM is obsolete (i.e. that it should no longer be used), I would note that quite a few tools have incorporated gapped alignments, and have benefitted as a result. Some of them even compare directly against RSEM and demonstrate marked improvement (see e.g. TIGAR — it gets some boost by considering an improved alignment model allowing gaps, and an even bigger boost by adopting the Variational Bayesian EM algorithm). The default alignment arguments for RSEM also discards orphaned alignments, though you probably wouldn't use that, alone, as an argument that it's the right thing to discard orphaned reads, right?

1
Entering edit mode

I've never realized that RSEM cannot handle gapped alignments and I say this with no disrespect - but a tool that cannot use make use of gapped alignment can't be all that accurate.

This has played out in the past the same way. Remember how bowtie1 was superfast and brought so much excitement into the field but would not perform gapped alignments. Try publishing anything with this tool today you shouldn't be able to, today we understand much better just how much variation is in the genome.

0
Entering edit mode

Well, I certainly agree that it is throwing out useful information. I was similarly surprised by the fact that it discards orphaned alignments. I think that, in part, other difficulties inherent in accurately estimating transcript-level abundance may dominate the negative performance impact RSEM incurs by ignoring gapped and orphaned alignments. However, as we continue to get a better handle on RNA-seq data, develop more accurate modes and better optimization algorithms, incorporating this information will become increasingly important (e.g. the alignment-based mode of Salmon accounts for such alignments rather than discards them, and we find that this does improve quantification accuracy, at least under certain important metrics).