Question

Salmon (or other pseudo-mappers) for multi-species RNAseq read filtering

0

Entering edit mode

5 months ago

ian.will ▴ 30

Hello all,

Background: I've inherited a new RNAseq data set and am thinking about updating my approaches (last time I did this I was using HISAT and Cuffdiff). I'd like some opinions on best strategies to disentangle/filter out parasite microbe reads from infected host reads before preforming a differential gene expression analysis on the host. Sample type examples: worm (C. elegans), worm + bacterial parasite, worm + mixed microbiome (+/- parasite) RNA reads

I did initial QC on a couple samples with STAR alignment of trimmed/quality filtered RNAseq reads against a combined-all-organisms concatenated genome fasta (with matching concatenated gtf annotation for transcriptome). I think even in samples that don't have every organism, using the some combined genome fasta for mapping is probably the right move (to keep any mutli-mapping biases and so the same). This worked fine to give me a sense of read abundance and multi-aligning frequency. Microbe reads were present, but too low for robust analysis - so my goal is just to get them out of the way (maybe recording abundance to say something generally about presence of the bugs).

Questions: However, I gather using a pseudo-mapper such as Salmon can be faster, more accurate at the transcript level, and pipes nicely into using DeSeq2 for analysis. It is not clear to me though if the pseudo-mapping approach can let me filter out all reads that map against any microbial contigs/transcripts before moving on to DeSeq2 analysis. I think I just get a quantification of each possible transcript and can't really identify anymore which reads that came from, if I understand Salmon's outputs correctly.

I could remove all microbe transcript rows in the output file based on their ID/names probably, but I have some lingering reservations about multi-mapping and such - but this could just be a suspicion based on ignorance as I'm new to this type of mapping approach.

Would it be necessary to use STAR first, remove reads other than those uniquely aligning to the worm, then either use a STAR output bam or filtered read set in Salmon?

Would a Salmon decoy genome composed of concatenated microbial genomes work for this (with the "target" genome being just worm)?

Even after producing alignments with STAR, using Salmon seems to have some advantages for DEG analysis I think - so it won't be a time saver in the end, but still maybe more accurate?

Thank you for any advice!

STAR Salmon mapper aligner RNAseq • 1.2k views

ADD COMMENT • link 4 months ago by ian.will ▴ 30

2

Entering edit mode

If you are actually interested in filtering/binning the reads they you need to use an aligner that is not using approximations. Try bbsplit.sh from BBTools to which you can provide all genomes at the same time. You are able to decide what to do with the multi-mappers within and across genomes.

If you want to do metagenomic RNAseq analysis (without any read separation) then carry on with salmon. If you don't know which specific contaminants you expect to be present then this would only be an approximation of what you provide as "reference".

ADD REPLY • link 5 months ago by GenoMax 141k

0

Entering edit mode

OK, great thank you. We do know exactly which genomes to expect (the "microbiome" is really just a 7-species cocktail the hosts are inoculated with). I'll take a look into bbsplit. I figure I could accomplish the same thing with STAR+samtools, but if BBTools makes this faster and easier, works for me!

ADD REPLY • link 5 months ago by ian.will ▴ 30

0

Entering edit mode

You can also look at seal.sh which is part of BBTools. It can also do read binning.

ADD REPLY • link 5 months ago by GenoMax 141k

0

Entering edit mode

BBsplit seems to work, although the results are a little different from my STAR-->samtools_idxstats check run I did before. Of course, some difference is to be expected with different tools and settings.

But, I wonder if any of you have some favorite settings for BBsplit when mapping RNAseq reads (paired, 150bp)? Not sure whether the BBsplit or the STAR output might be considered the more accurate program for aligning RNAseq reads across genomes. STAR I suppose was built with transcriptomics in mind, but BBsplit was built for multiple reference genomes and is mighty nice in how it outputs binned read files for me.

There are >600,000 reads fewer mapped by BBsplit (un+ambiguous) on the worm (only 3%, but 600k, if not totally random, could make a difference here). And for the main parasite in the microbe mix, it drops from 40,000 to 11,000 reads with BBsplit (too low anyhow for analysis, but it's a big difference). Otherwise, all the other <1,000 read microbes remain under that threshold with both tools, which is a good enough range for me for such rare transcripts.

I included the maxindel=100000 option as recommended by the docs for RNAseq (presumably to deal with introns). Although, I only have one eukaryote in the mix (which is the focal organism), wondering if there's a clever way to have a reduced the maxindel value applied to the other reference genomes (8 microbes)? Minimum alignment ratio could be something to play with (changing from 0.56), but not totally sure.

example run script: bbsplit.sh ref="list","of","genomes","files" maxindel=100000 ambiguous2=split in="readsR1" in2="readsR2"

(I might take a look at seal.sh if split somehow doesn't fit the bill)

Thoughts if my BBsplit run should specify more options for preparing reads for differential expression analyses? Thanks

ADD REPLY • link 5 months ago by ian.will ▴ 30

0

Entering edit mode

Maybe try the dualrnaseq pipeline in comparison? They just merge everything and split it afterwards again.

ADD REPLY • link 5 months ago by Matthias Zepper 4.5k

0

Entering edit mode

Since you are doing a mix of pro- and eukaryotes it is going to be difficult to find parameters that will work well in all situations. There will always be some ambiguity no matter what tool you use. You probably need to choose an option then move forward with the rest of the analysis.

Unless you have a need to separate the reads into organism specific bins using salmon approach may be equivalent to bbsplit.

ADD REPLY • link 5 months ago by GenoMax 141k

0

Entering edit mode

Yeah, the more I dig in, the more this seems the way. The worm has stellar genomic resources of course, so downloading a target transcripts.fa for Salmon is easy. The microbes have genomes (with gtf annotations) but not a true transcriptome file exactly. So I am going to try mapping to worm transcripts, with a decoy.fa composed of the worm genome + microbes genomes that should shake off any microbial reads - I think. This somewhat parallels one of the run options of that dualrnaseq pipeline, as I understand it.

Thanks to all for the input! (and like, stop me if this is wrong haha)

ADD REPLY • link 5 months ago by ian.will ▴ 30

0

Entering edit mode

but not a true transcriptome file exactly.

Since you have GTF's you can easily create the microbial transcriptome using an utility like gffread (LINK). If you know all component organisms then using their exact transcriptomes is going to give you the best results. Use them as a single pooled reference with salmon.