Best Tool for Fast RNA-Seq Transcriptome Alignment to Extract Unmapped Reads
2
2
Entering edit mode
3 months ago
aboll ▴ 20

Hello all,

I have RNA-Seq data from human cancer samples. I ultimately want to single out reads that do not align properly (i.e., don't align or are discordant/split-mapped/etc.) to the human reference for downstream analysis. I could do this by mapping to the genome with STAR which is what we typically do for alignment. However, I want to find the quickest/least expensive way. The idea is to first align to the transcriptome and keep only those unaligned/ambiguous reads for further analysis. I can then align these to the genome with STAR and my guess is this would be cheaper/faster than just aligning all of them to the genome to begin with. Please correct me if this is immediately wrong.

Now my question is what is the best approach for the initial transcriptome mapping. Based on comparing several tools (STAR, Bowtie2, Hisat2, kallisto), I think using Bowtie2 to map reads to human transcripts. STAR is generally not used for mapping to the transcriptome and has a higher memory usage. Kallisto pseudo-alignment is fast but it wouldn't give me the results I want, which are the unaligned reads back. Hisat2 seems comparable to Bowtie2, but in this case I don't need the alignment to be splice-aware, and it seems like Hisat2 is not as well maintained. Please let me know if you have done something similar or have any thoughts on this. Thanks!

mapping rna-seq • 974 views
ADD COMMENT
0
Entering edit mode
3 months ago
GenoMax 153k

my guess is this would be cheaper/faster than just aligning all of them to the genome to begin with

If you are paying for all of your compute (e.g. CPU/storage ) then possibly, but it should not be a big difference. Since you don't want to do the analysis again and again, spending a bit more (in money and time) upfront would be advisable. Aligning to data to the genome will take care of all expressed reads, even those that may not be recognized by the transcriptome (previously unknown transcripts).

You can then filter the aligned file with samtools using the answer here : samtools extract unmapped reads


If you are using salmon then you could use the --writeUnmappedNames parameter to get the read names that don't map (ref: https://salmon.readthedocs.io/en/latest/salmon.html#writeunmappednames). You can then use filterbyname.sh from BBMap suite to get those reads out of the original data.


If you are open to using bbmap.sh the aligner from BBMap suite then you can capture reads that do not align at the time you do the mapping. You could do something like (increase the -Xmx10g to -Xmx30g, if you are using a human size genome). You will need to have samtools available in $PATH to get the BAM file.

$ bbmap.sh threads=N -Xmx10g in1=R1.fq.gz in2=R2.fq.gz outu1=R1_unmapped.fq.gz outu2=R2_unmapped.fq.gz out=aligned.bam ref=reference_genome.fa ambig=random

If you do not want to get the aligned data then simply remove out=aligned.bam from the command line.

ADD COMMENT
0
Entering edit mode

Thanks for your response! I understand the cost difference may not be much... I am considering to compare this approach to just aligning to genome with STAR to see if we get anything better than negligible improvement. This is interesting, I didn't realize salmon wrote out the unmapped read names, thanks. I'll look into salmon/bbmap.

ADD REPLY
0
Entering edit mode
3 months ago
dsull ★ 7.7k

Honestly, I'd just stick with STAR for this purpose. You can use the --outReadsUnmapped option to directly output the unmapped reads in FASTQ format or the --outSAMunmapped to directly output the unmapped reads in SAM format. STAR also has an option to create a sparse genome index if memory is a concern.

You can use lightweight tools like kallisto or salmon, but you have to actually go through the FASTQ file TWICE in order to get your unmapped reads (you go through it once for the pseudoalignment, and then you go through it a second time to extract the reads that were not pseudoaligned); this will therefore be much less efficient. I actively develop kallisto, so if you want, I'm happy to consider outputting the unmapped reads in FASTQ format as a feature in a future release of kallisto.

bowtie2-to-transcriptome will probably be more efficient than STAR. However, I've tried using bowtie2 to output unaligned reads in the past, and I remember have issues in paired-end mode: going from a paired-end input FASTQ to a paired-end output unmapped FASTQ (I don't think it's possible to directly do this in bowtie2).

ADD COMMENT
0
Entering edit mode

Thanks for your input. I read STAR is generally not meant to align to the transcriptome and there would be other more optimized ways to do this. Maybe your sparse genome index suggestion would mitigate this.

For kallisto, that sounds like it would be helpful! I would be interested to try this out and incorporate it if this ever becomes available.

Thanks for the heads up about using bowtie2 for getting unaligned reads. I see what you're saying. I'll check it out and see if I have issues since we do generally have paired-end data.

ADD REPLY
0
Entering edit mode

Hey @delaneyksull. Following up on this, it looks like kallisto has a --fusion option in older versions (before version 0.50.0). This looks similar to something I would want to implement. Is there a reason this is no longer maintained in newer versions of kallisto?

ADD REPLY
1
Entering edit mode

When we did a huge revision of the kallisto source code, we found it to cumbersome to maintain the --fusion option.

However, we have plans to restore it in the next release of kallisto (we actually already wrote most of the code to do so).

ADD REPLY

Login before adding your answer.

Traffic: 3738 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6