I am attempting to use Salmon (version 0.12.0) to quantify transcript counts from RNAseq data (unstranded paired-end reads).
The existing pipeline in my lab is to trim/qc fastq files, align them with STAR (2.5.2a), merge sample BAMs together (if they were spread across lanes), sort them with samtools, and then feed them to Salmon in alignment-based mode. STAR was run with the --quantMode TranscriptomeSAM
option and the genomeDir
pointed to a genome generated using STAR's genomeGenerate
function with the --sjdbGTFfile
option pointing to a GTF file.
Here's the Salmon run line:
salmon quant -t /hg38_salmon_transcriptome.fa -l IU -p 16 -a some.transcriptome.sorted.bam -o ./
This produces a quant.sf file that seems to make sense, although it also produces a massive (15+ GB) error file that seems really upset about suspicious pairs (see here for someone else with the same issue: Salmon warning detected suspicious pair )
WARNING: Detected suspicious pair ---
The names are different:
read1 : K00274:68:HGYCHBBXX:5:1119:13311:37220
read2 : K00274:68:HGYCHBBXX:3:2103:3325:33598
I was a bit sick of this behavior and decided to give Salmon the fastq files directly in alignment-free mode. This pipeline was to trim/qc my fastqs, combine the files from different lanes and then gives the reads to salmon. The program runs fine and produces no errors.
Here's that Salmon run line:
salmon quant -i /hg38_salmon_transcriptome_index -l IU -1 samp_R1.fq.gz -2 samp_R2.fq.gz -o ./
The big issue is that the quant.sf files look really different between these two pipelines. Like the transcript ENST00000361739 (this is the MT-CO1 gene) has a TPM of 40735 in alignment-free mode, but a TPM of 105 in alignment-mode. Further, the range of TPMs (and counts) varies widely between the two modes. In alignment-free mode, I'm getting TPMs in the thousands, but in alignment mode the highest is 300 and most values are under 100.
So, my questions are:
- Does anyone know what's causes the "suspicious pair" error when I'm running in alignment-based mode?
- Why am I getting such huge differences in the outputs of these two strategies?