Salmon: Optimal k-mer size (for indexing) for RNA-seq data alignment using reference genome
1
2
Entering edit mode
5.2 years ago
BCArg ▴ 70

Dear all,

I am using salmon to evaluate how the L.donovani gene expression varies in the two different growth phases (promastigotes and amastigotes). For that I downloaded two RNA-seq data from SRA and I aligned those to the reference L.donovani coding sequence coding sequences (CDS)

In order to quantify gene expression with salmon I have to index the CDS using a specific k-mer value. In the own salmon manual they use a 31-mer for 75bp reads. My question is whether is reasonable to choose this same value i.e. 0,413reads length or if there's any other advisable value. I heard that some people use about 0,33reads length whereas I also found this post where is is metioned between 0,5 and 0,66*reads length (although this if rather for de novo genome assembly, which is not my case as I am mapping to a reference CDS.)

RNA-Seq alignment salmon k-mer indexing • 5.8k views
2
Entering edit mode
5.2 years ago
Rob 4.9k

Hi BCArg,

The mapping approach used in salmon is fairly robust to the specific choice of k. The default value of 31 should probably work well for your data. You can't make it larger, as 31 is the maximum supported value (it also doesn't make too much sense to make it larger, as salmon's use of k-mers is very different than how k-mers are used in de novo assemblers, since it will also consider arbitrarily longer matches beginning with a given k-mer by matching them in the suffix array).

If you are concerned that the k-mer size is too large (which is likely not the case), you could always build a separate index with a smaller k-mer size, and see how that affects the reported mapping rate. Salmon reports the mapping rate on the console during a run, and this value is also recorded in the directory <quantification_dir>/aux/meta_info.json, where <quantification_dir> is whichever directory you pass to salmon via the -o flag.

0
Entering edit mode

That's what I've been doing, playing around with different k-mer values and seeing how it will affect my gene expression. I forgot to mention that my reads' length is 36bp. Thanks for the reply!

1
Entering edit mode

Ahhh --- yes; with 36bp reads you likely do want to use a smaller k (perhaps 19 or 17). The default is optimized for the most common read lengths (typically >76bp).

0
Entering edit mode

Hi again Rob, I have mapped my reads using three different k-mers sizes, namely 11, 17 and 23 (reads length = 36bp). The mapping rates that I had (for the same RNA-seq reads) using these k-mers were 1,22; 11,57 and 4,81%. Can I say that the higher the mapping rate the better? Therefore in this case I should use the 17-mer?

0
Entering edit mode

Generally, yes. A higher mapping rate means that you're accounting for more of the data, and are likely to get better / more-complete results. For 36-bp reads, 17-mers seem like a reasonable set on which to index. I don't know much about L.donovani, but the overall mapping rates seem very low. Is that common in this organism? There are other considerations that could affect the mapping rate (e.g. if the underlying library is Ribo-, there could be a considerable fraction of reads coming from ribosomal RNA, etc.). Salmon is generally sensitive when mapping, so I'm guessing the low mapping rate is the result of the true sequences of origin not being present in your set of target transcripts.

1
Entering edit mode

I believe it was indeed the quality of my library. I visualised the alignment with IGV and I could clearly see that only a small portions of the reads mapped to the reference genome.

0
Entering edit mode

I have exactly the same question, short read length(=36) and very low mapping rates (~9.5%). I used cDNA to generate my index file and run salmon in quasi-mapping mode.
For the low mapping rates, there were suggestions that I should include rRNA and build the index file from gtf file.(Salmon very low mapping "Salmon will assign them an abundance of 0 (or very close to 0), but in those samples where you have a higher fraction of rRNA, they will show higher abundance. If you are building your transcript set (for indexing) from a gtf file, you can include these transcripts based on their biotype.") Do you have any update for this, BCArg?

0
Entering edit mode

An update to my data: I tried kmers setting to 13,17 and 19, respectively. For 13, only 5.6% mapping rate; For k=17, ~49% mapping rate; for k=19, ~35% mapping rate. I also tried to exclude ncRNA in the mapping, by setting k=17, the mapping rate did not change much, still ~49%. Note that, for both k=17 and 19, the hits per frag are ~2.3. But for k=13, the hits per frag is ~0.2. I will try running DESeq2 with these three parameters to see whether it makes a big difference in terms of DEGs.