Salmon ~ Effective Length
1
0
Entering edit mode
16 days ago
chrisk • 0

Hello Biostars community,

A question regarding the effect of the fragment length distribution on Salmon's EffectiveLength computation, for TPM, based on our below situation.

  • 150 bp sequencing
  • paired end library
  • 90 percent of reads on genes are overlapping
  • negative inner distance reported by RSeQC (-150bp to -100bp)

In most sequencing libraries, fragmentation and sequencing are set to avoid overlapped PE reads, whereas ours is obviously not.

How do these 2 different scenarios above, affect the fragment length distribution calculations (assuming its impossible to calculate insert size when paired end reads do not overlap in an RNA library).

Thanks in advance, Chris

Salmon • 365 views
ADD COMMENT
1
Entering edit mode

assuming its impossible to calculate insert size when paired end reads do not overlap in an RNA library

If you have a reference available then paired-end sequencing allows one to estimate the length of the library fragment being sequenced by inferring how far apart the two reads map/align on the reference. One of the exceptions would be, if the library fragment captured a breakpoint (e.g. two ends map to two different chromosomes). In that case it is not possible to estimate insert size.

90 percent of reads on genes are overlapping

You have a "short" insert library. There is no solution for this specific issue except making a new prep/library.

ADD REPLY
1
Entering edit mode
16 days ago
Rob 6.7k

Salmon will compute the fragment length distribution based on the (probabilistically weighted) implied distance between the fragment ends given the mapping information. That is, in general, the statement "its impossible to calculate insert size when paired end reads do not overlap in an RNA library" isn't quite true. Consider a fragment that has a unique mapping in the transcriptome for both ends. In this case, even if the read ends do not overlap, we can use the alignment to infer the length of the original fragment prior to sequencing. It is the difference between the leftmost and rightmost mapped positions of the read ends on the reference sequence where they both map. Now, obviously, in the presence of read-mapping uncertainty (i.e. multimapping reads) this becomes more difficult. However, Salmon overcomes this by using the estimated transcript abundances as computed during the online phase of its inference algorithm to probabilistically update its estimate of the fragment length histogram based on all of the observed mappings of a fragment. Even without this complexity, the fragment length distribution is often relatively smooth and "well-behaved" and can thus be robustly estimated from a relatively small number of samples.

Anyway, regardless of whether or not the fragment ends overlap, Salmon will make use of the mappings of the fragment ends to the indexed transcriptome to infer the implied length of the underlying fragments, and will use these values (appropriately weighted) to estimate the fragment length distribution.

ADD COMMENT
0
Entering edit mode

Hi @Genomax and @Rob,

Thanks for the note and my apologies for the delay.

@Genomax, couldn't agree more in relation to the short insert library (and negative inner distance) as a central concern. We don't anticipate this going away in the future wet lab, so we are erring on the side of caution with regards to data assumptions.

@Rob, thanks very much for the detail. It is reassuring to know that due to the transcriptome alignments and paired-end reads, this is a win-win situation. We had a further think about your points, added with extra reading from the COMBINE-lab GitHub. Specifically, we found your information regarding the values --fldMean and --fldSD being used for prior parameters of a normal distribution which is then truncated on the left at 0 very helpful ( post 127 https://github.com/COMBINE-lab/salmon/issues/127 ).

We took a look at our fastq files, pre and post trimming. On average, the Median is lower than the mean (albeit by approx. 10) and a histogram shows a positive right skew (e.g. 1.032, when using "skewness" in the R "moments" package.) Are heavy tails adjusted/detected by Salmon, causing an update to the prior distribution and possibly transforming the gaussian probabilistic model? I hope that made some sense.

Would the positive skew have any sort of impact on TPM output, and would it be preferential to not use TPM for further analysis downstream?

Thanks in advance, Chris

Example Fragment Distribution

ADD REPLY
0
Entering edit mode

Note that TPMs from salmon are transcript level, so in any case they would typically not be used downstream, unless you want to analyze transcripts rather than genes. I would simply aggregate the counts to gene level with tximport and go along with that. Normalize with DESeq2 or edgeR, or some transformation such as vst. Use that downstream.

ADD REPLY
0
Entering edit mode

Hi chrisk

The non-normality of the final distribution shouldn't be a problem. Salmon only uses a truncated normal as the prior, but the final density is an empirical distribution, so it will conform to the observed fragment lengths. In fact, the multiQC tool has the ability to read the salmon output formats and will actually draw a nice histogram of the fragment length distribution as inferred by salmon.

To the second point about downstream analysis; in general, one wouldn't use the TPM for much quantitative analysis anyway, as TPM is a "within-sample" normalization technique and questions like differential expression will have to account for differences in library size. The tximport package, however, will take care of all of this for you, accounting for the effective transcript lengths (and even the fact that effective length for the same transcript and change between samples based on the fragment length distribution) and the library sizes of the different samples.

ADD REPLY

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6