Question: Can Salmon's quantification method accommodate concatenated reads of varying length (from independent sequencers) in a single fastq file?
1
gravatar for c_dampier
7 months ago by
c_dampier60
USA/UVA
c_dampier60 wrote:

Good evening

My question is whether it is more appropriate to feed Salmon a single concatenated fastq file or multiple sequencer- and read-length-specific fastq files when the reads in the fastq file (or files) have been generated at different times with different sequencing lengths for a given sample. The Salmon documentation is clear enough regarding Salmon's ability to accommodate concatenated fastq files from a single library, but I'm concerned about the effect of varying read lengths on the quantification process.

My motivation for this question is that I have a dataset generated over several years wherein certain samples with insufficient read depth were sequenced multiple times, and the different sets of reads were concatenated into single, sample-specific fastq files. I could load the single, concatenated fastq file for a given sample into Salmon, or I could decompose the fastq file for a given sample into multiple sequencer- and read-length-specific fastq files, and then load them separately into Salmon. (I could also decompose them and then load them together into Salmon using the referenced multiple read file approach, but I will resist that temptation.) My concern with the first (single file) approach is that Salmon would apply a quantification scheme to all reads that is only applicable to a subset of the reads. My concern with the second (multiple file) approach is the converse; multiple schemes will be applied when a single scheme would be more appropriate.

If I use the first (single file) approach, I think I should at least shuffle the reads (per read order section). If I use the second (multiple file) approach, should I use the same or different indices (with different k values most appropriate for read length)? I am using Salmon in non-alignment-based mode with a quasi-mapping-based index.

sequencing rna-seq salmon • 286 views
ADD COMMENTlink written 7 months ago by c_dampier60
2

I would quantify every run separetely and then do a couple of diagnostics (PCA, correlations) to ensure that there are no confounding effects due to sequencing machine/center. kmer length in my experience is not too much of a factor, mapping% will once change slightly (see e.g. Salmon Quantification for RNA-seq Read Pairs with Different Lengths ), given that read length >= kmer length. I would use the same length for all files.

ADD REPLYlink modified 7 months ago • written 7 months ago by ATpoint24k
1

Thank you, ATpoint. Your comments are helpful, as is the link to the very informative discussion on read pairs with different lengths. I regret not finding it when I initially started searching for questions related to mine. Thank you for the link.

ADD REPLYlink written 7 months ago by c_dampier60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1833 users visited in the last hour