Question: Big differences between mappings computed by Salmon and quantification
2
gravatar for KDA
10 weeks ago by
KDA20
Paris
KDA20 wrote:

Hello !

I have been using Salmon to quantify RNA-seq of tumoral tissues using a quasi index of 161 transcripts, part of them being endogenous retroviruses sequences. Here is the command info of the quantification of a sample :

{
"salmon_version": "0.10.2",
"index": "indexes/DB7",
"libType": "A",
"mates1": "output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed1.fastq",
"mates2": "output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed2.fastq",
"threads": "8",
"numBootstraps": "100",
"seqBias": [],
"gcBias": [],
"writeMappings": "bam/3963279a-4960-49a2-936a-a13bb4a7dde5",
"output": "output/HSNC/salmon/3963279a-4960-49a2-936a-a13bb4a7dde5",
"auxDir": "aux_info"
}

I've also used the "--writeMappings" argument for some samples and with a combination of samtools view, sort, and index, created sorted BAM files in order to vizualize them in IGView (I also created with IGView a pseudo-genome based on my set of transcript). However, for some transcripts, there are big differences between the observed coverage and the actual quantification computed by Salmon. I've read the documentation about that and understand that the mappings are computed before the quantification, so they're bound to be different. However, for some transcripts, particularly a family of retroviral enveloppes (K family), there are huge gaps sometimes. For some transcripts, the quantification gives a NumReads of 0 while hundreads or thousands of reads can be observed in IGView.

The documentation says that the reads in the mapping can be incompatible with the library type inferred since its done before filtering, however, my lib_format_counts.json file shows a compatible fragment ratio of 100%, so I don't think it's the problem :

{
"read_files": "( output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed1.fastq, output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed2.fastq )",
"expected_format": "IU",
"compatible_fragment_ratio": 1.0,
"num_compatible_fragments": 2737879,
"num_assigned_fragments": 2737879,
"num_frags_with_consistent_mappings": 1714617,
"num_frags_with_inconsistent_or_orphan_mappings": 1023262,
"strand_mapping_bias": 0.4979922629951762,
"MSF": 0,
"OSF": 0,
"ISF": 860751,
"MSR": 0,
"OSR": 0,
"ISR": 853866,
"SF": 526731,
"SR": 496531,
"MU": 0,
"OU": 0,
"IU": 0,
"U": 0
}

I've run RSEM on these samples for comparison purposes and the quantification of RSEM roughly match up the one done by Salmon, so I don't think the quantification is at fault here. I thought that maybe since this affects mostly an entire family of retroviruses with often homologous sequences, during the mapping phase the reads from one sequence are distributed to all similar sequences, and then it's corrected after ?

My question is, how can I validate or invalidate this hypothesis ? Or is there another reason that could explain this incoherence between the mappings and the quantification ?

Thanks in advance !

ADD COMMENTlink modified 10 weeks ago by Rob2.6k • written 10 weeks ago by KDA20
1

A quick and dirty explanation: if one transcript get a lot of reads but all of them multi-mappers, and a second location get the same lots of multi-mappers (they share all the multi-mappers) but also get a lot of uniquely mapped reads, the EM-algorithm used by Salmon will assign all the counts to the second transcript.

Also tagging Rob for a more elaborate and definitive answer.

ADD REPLYlink written 10 weeks ago by h.mon19k
1
gravatar for Rob
10 weeks ago by
Rob2.6k
United States
Rob2.6k wrote:

Hi KDA,

Actually, there isn't much to add to h.mon's comment --- so I'd suggest that maybe he move it from a comment to an answer. Basically, having transcripts with many mapping reads but large gaps in coverage be assigned abundance close to 0, especially when there are other transcripts _without_ gaps in coverage explaining the same reads, is exactly what you would expect. Because of shared sequence, many reads will multi-map to transcripts that turn out to be poor explanations for these reads in light of other evidence. This is exactly what the statistical inference procedure in salmon (and RSEM etc.) is meant to resolve. You could try and track down some of the specific cases. For example, pick a read that maps to one of these transcripts (the ones with many multi-mapping reads, but with gaps in coverage), and then look at the other transcripts where it maps. Find one that has high abundance. Does that transcript look to have a more well-behaved and expected coverage profile? Another quick thing you could try is to find these transcripts that get assigned 0 or near-0 abundance, and remove them from the reference. If you re-quantify without them, do you explain approximately the same number of reads? If so, that means that the same number of sequenced fragments could be explained without these transcripts, and so the assignment of a near-0 abundance to them may be reasonable.

ADD COMMENTlink written 10 weeks ago by Rob2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 643 users visited in the last hour