Question

Salmon and reads order in fastq in quasi-mapping-based mode

0

Entering edit mode

6.4 years ago

ZheFrench ▴ 570

Really a naïve question but it's a little while I have this one in mind.

From Salmon doc : "If your reads or alignments do not appear in a random order with respect to the target transcripts, please randomize / shuffle them before performing quantification with Salmon."

I don't understand if they are talking about bam and fastq. I understand that a bam can be ordered in different ways but fastq....

But I'm using Salmon in quasi-mapping-based mode on the fastqs. I was wondering if they can be ordererd in a way that need to be shuffled before use with salmon ?

I mean when you download paired-end fastq using sra split-3 option, you will have several R1 & R2 files ordered(how by the way ? ) Same question when you received data directly from your sequencing platform. You pull your different R1 , R2 files separately to use salmon.

Do you need to shuffle the reads in fastq before launching salmon ?

Salmon • 2.6k views

ADD COMMENT • link updated 6.4 years ago by ATpoint 81k • written 6.4 years ago by ZheFrench ▴ 570

0

Entering edit mode

Unless you have a co-ordinate sorted BAM alignment file that you converted to fastq (or you had used a program like clumpify from BBMap, which re-orders raw reads, when it does de-duplication based on sequence alone), you should not have your reads in any kind of order.

ADD REPLY • link 6.4 years ago by GenoMax 141k

score 1 · Answer 1 · 2017-12-02

1

Entering edit mode

6.4 years ago

ATpoint 81k

If the fastqs come right from the sequencer, you are fine. The thing is that you can transform BAM back to fastq for realignments/requantification, and as BAMs are often coordiate-sorted, the resulting fastq would not be randomly ordered. Therefore the recommendation is to shuffle fastq prior to quantification. Btw, the same holds true for every alignment. E.g. with BWA mem, the fastq is expected to be randomly ordered because BWA estimates the true insert sizes in paired-end mode from the chunk of reads that are currently processed. In case of coordinate-sorted fastq (from a BAM) you'd get chunks from repetitive or low-complexity regions which would skew insert size estimation for that region, leading to false mapping results. Random fastq order compensates for this, as the probability to get chunks that origin from the same genomic region are quiet low.

ADD COMMENT • link 6.4 years ago by ATpoint 81k

0

Entering edit mode

This makes sense, but what about when putting BAM files directly into salmon?

I've taken fastq files from the sequencer, aligned them with STAR and then sorted with samtools, and then put the sorted BAMs into salmon. The text from the Salmon doc indicates that this could be a problem, but I'm not sure if it's only relevant to processing fastq files directly...

ADD REPLY • link 5.2 years ago by MaxF ▴ 120

0

Entering edit mode

The text from the Salmon doc indicates that this could be a problem

Which text? Did you align against a transcriptome rather than genome? See

ADD REPLY • link 5.2 years ago by ATpoint 81k

0

Entering edit mode

I used STAR in transcriptome mode (--quantMode TranscriptomeSAM), but I did align it against the hg38 genome.

The text I was referring to is what ATpoint quoted: "If your reads or alignments do not appear in a random order with respect to the target transcripts, please randomize / shuffle them before performing quantification with Salmon."