Question

Big differences between mappings computed by Salmon and quantification

2

Entering edit mode

5.8 years ago

KDA ▴ 20

Hello !

I have been using Salmon to quantify RNA-seq of tumoral tissues using a quasi index of 161 transcripts, part of them being endogenous retroviruses sequences. Here is the command info of the quantification of a sample :

{
"salmon_version": "0.10.2",
"index": "indexes/DB7",
"libType": "A",
"mates1": "output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed1.fastq",
"mates2": "output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed2.fastq",
"threads": "8",
"numBootstraps": "100",
"seqBias": [],
"gcBias": [],
"writeMappings": "bam/3963279a-4960-49a2-936a-a13bb4a7dde5",
"output": "output/HSNC/salmon/3963279a-4960-49a2-936a-a13bb4a7dde5",
"auxDir": "aux_info"
}

I've also used the "--writeMappings" argument for some samples and with a combination of samtools view, sort, and index, created sorted BAM files in order to vizualize them in IGView (I also created with IGView a pseudo-genome based on my set of transcript). However, for some transcripts, there are big differences between the observed coverage and the actual quantification computed by Salmon. I've read the documentation about that and understand that the mappings are computed before the quantification, so they're bound to be different. However, for some transcripts, particularly a family of retroviral enveloppes (K family), there are huge gaps sometimes. For some transcripts, the quantification gives a NumReads of 0 while hundreads or thousands of reads can be observed in IGView.

The documentation says that the reads in the mapping can be incompatible with the library type inferred since its done before filtering, however, my lib_format_counts.json file shows a compatible fragment ratio of 100%, so I don't think it's the problem :

{
"read_files": "( output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed1.fastq, output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed2.fastq )",
"expected_format": "IU",
"compatible_fragment_ratio": 1.0,
"num_compatible_fragments": 2737879,
"num_assigned_fragments": 2737879,
"num_frags_with_consistent_mappings": 1714617,
"num_frags_with_inconsistent_or_orphan_mappings": 1023262,
"strand_mapping_bias": 0.4979922629951762,
"MSF": 0,
"OSF": 0,
"ISF": 860751,
"MSR": 0,
"OSR": 0,
"ISR": 853866,
"SF": 526731,
"SR": 496531,
"MU": 0,
"OU": 0,
"IU": 0,
"U": 0
}

I've run RSEM on these samples for comparison purposes and the quantification of RSEM roughly match up the one done by Salmon, so I don't think the quantification is at fault here. I thought that maybe since this affects mostly an entire family of retroviruses with often homologous sequences, during the mapping phase the reads from one sequence are distributed to all similar sequences, and then it's corrected after ?

My question is, how can I validate or invalidate this hypothesis ? Or is there another reason that could explain this incoherence between the mappings and the quantification ?

Thanks in advance !

RNA-Seq Salmon Integrated Genome Viewer • 5.0k views

ADD COMMENT • link updated 5.8 years ago by Rob 6.5k • written 5.8 years ago by KDA ▴ 20

1

Entering edit mode

A quick and dirty explanation: if one transcript get a lot of reads but all of them multi-mappers, and a second location get the same lots of multi-mappers (they share all the multi-mappers) but also get a lot of uniquely mapped reads, the EM-algorithm used by Salmon will assign all the counts to the second transcript.

Also tagging Rob for a more elaborate and definitive answer.

ADD REPLY • link 5.8 years ago by h.mon 35k

score 3 · Accepted Answer · 2018-07-12

Hi KDA,

Actually, there isn't much to add to h.mon's comment --- so I'd suggest that maybe he move it from a comment to an answer. Basically, having transcripts with many mapping reads but large gaps in coverage be assigned abundance close to 0, especially when there are other transcripts _without_ gaps in coverage explaining the same reads, is exactly what you would expect. Because of shared sequence, many reads will multi-map to transcripts that turn out to be poor explanations for these reads in light of other evidence. This is exactly what the statistical inference procedure in salmon (and RSEM etc.) is meant to resolve. You could try and track down some of the specific cases. For example, pick a read that maps to one of these transcripts (the ones with many multi-mapping reads, but with gaps in coverage), and then look at the other transcripts where it maps. Find one that has high abundance. Does that transcript look to have a more well-behaved and expected coverage profile? Another quick thing you could try is to find these transcripts that get assigned 0 or near-0 abundance, and remove them from the reference. If you re-quantify without them, do you explain approximately the same number of reads? If so, that means that the same number of sequenced fragments could be explained without these transcripts, and so the assignment of a near-0 abundance to them may be reasonable.