Can Salmon's quantification method accommodate concatenated reads of varying length (from independent sequencers) in a single fastq file?
0
1
Entering edit mode
5.1 years ago
c_dampier ▴ 60

Good evening

My question is whether it is more appropriate to feed Salmon a single concatenated fastq file or multiple sequencer- and read-length-specific fastq files when the reads in the fastq file (or files) have been generated at different times with different sequencing lengths for a given sample. The Salmon documentation is clear enough regarding Salmon's ability to accommodate concatenated fastq files from a single library, but I'm concerned about the effect of varying read lengths on the quantification process.

My motivation for this question is that I have a dataset generated over several years wherein certain samples with insufficient read depth were sequenced multiple times, and the different sets of reads were concatenated into single, sample-specific fastq files. I could load the single, concatenated fastq file for a given sample into Salmon, or I could decompose the fastq file for a given sample into multiple sequencer- and read-length-specific fastq files, and then load them separately into Salmon. (I could also decompose them and then load them together into Salmon using the referenced multiple read file approach, but I will resist that temptation.) My concern with the first (single file) approach is that Salmon would apply a quantification scheme to all reads that is only applicable to a subset of the reads. My concern with the second (multiple file) approach is the converse; multiple schemes will be applied when a single scheme would be more appropriate.

If I use the first (single file) approach, I think I should at least shuffle the reads (per read order section). If I use the second (multiple file) approach, should I use the same or different indices (with different k values most appropriate for read length)? I am using Salmon in non-alignment-based mode with a quasi-mapping-based index.

RNA-Seq rna-seq sequencing salmon • 1.3k views
ADD COMMENT
2
Entering edit mode

I would quantify every run separetely and then do a couple of diagnostics (PCA, correlations) to ensure that there are no confounding effects due to sequencing machine/center. kmer length in my experience is not too much of a factor, mapping% will once change slightly (see e.g. Salmon Quantification for RNA-seq Read Pairs with Different Lengths ), given that read length >= kmer length. I would use the same length for all files.

ADD REPLY
1
Entering edit mode

Thank you, ATpoint. Your comments are helpful, as is the link to the very informative discussion on read pairs with different lengths. I regret not finding it when I initially started searching for questions related to mine. Thank you for the link.

ADD REPLY

Login before adding your answer.

Traffic: 1502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6