Salmon Quantification for RNA-seq Read Pairs with Different Lengths
2
5
Entering edit mode
3.0 years ago
ATpoint 53k

We have some older published RNA-seq data that we would like to quantify with Salmon, but paired-end read lengths differ quiet a bit, ranging from 2x100bp, over 75/25bp, 50/50bp, 35/35bp and some other combinations. Should one use the same index for all files, so like -k=19 (to handle the 35bp reads), or rather several ones, like -k=31 for the longer reads, and -k=19 for for the shorter reads? Would it induce a kind of mappability/quantification bias if one uses different indices for different files, especially if different runs (lane replicates) belong to the same patient? Another possibility would be to trim all reads to a fixed read length, which in this case would be the shortest read length in the whole dataset (25bp). Your opinions would be appreciated.

EDIT: Following up on this, I quantified four different samples (varying read lengths, x-axis) with salmon, default settings and --validatemappings against the human gencode_v28 (varying kmer length, legend) and saw that the kmer length plays a very minor role in the mapping percentage (not discussing now if these samples are good or bad towards quality):

As a comparison, these samples, mapped with HISAT2, defaults, against hg38, have the following mapping rates,

(%mapped,%properly-paired):

36-36 (75%, 63%) 51-51(76%, 69%) 75-75 (98%, 95%) 76-26 (43%, 0.06%)

The kmer length has a little influence on the mapping rate. Therefore, (as most samples I have have either 50 or 76bp reads) I will probably use something like -k=25 for all samples, categorically discard those that have either a fwd or rev read length of 26bp, and indicate the read length in the DE design.

RNA-seq Salmon Quantification • 2.6k views
0
Entering edit mode

tagging: Rob

4
Entering edit mode
3.0 years ago
Rob 4.9k

Hi ATpoint,

I would probably recommend going with the default kmer length down to the case where the shorter of the reads is 75bp, and then a smaller kmer size after that. As h.mon suggests, It's probably a good idea to include read length in your de design matrix anyway, as it could cause a batch effect even with traditional alignment.

I'd also probably recommend grabbing a recent salmon and running with the --validateMappings flag, as this will improve the sensitivity and specificity of mappings, and I would expect this to help even more (relatively) with shorter reads. I would definitely not trim data to the shortest length (25bp), as you're much more likely to lose a lot of signal that way than to save yourself from creating potentially addressable batch effects.

0
Entering edit mode

Thank you Rob for the suggestions. I updated my post with some mapping percentages I obtained from samples with different read lengths with kmer sizes from 17 to 31. Kmer length seems to have quiet a limited impact, and the actual read length is the factor that mostly determines the mapping/quantification rate (not discussing now the influence of library prep and contaminations etc).

1
Entering edit mode

Hi ATpoint,

Thanks for the detailed follow-up! It's good to see these empirical results, since it's my hope that the effect of the k-mer size wouldn't be too drastic in general. After all, the k-mer is used only as a seed, and the full matching is done against the suffix array. Also, the --validateMappings flag reduces the stringency to consider a putative mapping (but then validates these using an actual alignment score, hence the name), and so should increase the sensitivity in such cases.

3
Entering edit mode
3.0 years ago
h.mon 33k

I will use mapping in a loose sense here, as Salmon quasi-maps.

In addition to your kmers question, remember the "mappability" of reads of different length varies: the longer the read, the more uniquely "mappable" it is - so you already have an effect due to length alone, and specifically speaking, longer reads are more accurate. So trimming all datasets to the shortest read length will decrease overall precision of count estimates.

Decreasing kmer lengths of shorter reads would probably increase their mapping rate (and also multi-mapping rate), but I wonder how much of the increase would be correct quantification and how much would be just noise.

If you have a good number of datasets over all read lengths, you can add read length as a factor in the differential gene expression analysis model.

Would it induce a kind of mappability/quantification bias if one uses different indices for different files, especially if different runs (lane replicates) belong to the same patient?

Do you mean the same library preparation has been sequenced more than once with different read lengths? If so, you can explore the effect of read length on quantification with this data, as these are technical replicates with only read length differing.