Transcript level counts from full length scRNA data
1
1
Entering edit mode
16 months ago

Hi All,

I am wondering if anyone estimated transcript-level counts from plate based scRNA datasets like smart-seq that capture full length transcripts instead of 3' ends. I tried to use salmon but the estimated counts/TPMs seems to be highly inflated.

For example, for this transcript, the output from a single cell looks like this.

Name    Length  EffectiveLength TPM     NumReads
ENST00000344843.7       721     250     247.016 471.945


When I looked the same cell bam file in IGV, that transcript has only around 50 reads mapped. Salmon NumReads is 471. This happens for a lot of transcripts.

I am wondering why the values are inflated ? One potential reason could be due to the default scaling factor used, as for bulk-rna where the total counts tend to be in millions.

I would like to know if I anyone estimated the transcript counts from scRNA before. There are tools like Alevin but they seem to work only with 3' enriched droplet based methods.

Thanks,

Goutham A

scRNA RNA-Seq salmon star • 772 views
1
Entering edit mode

When I looked the same cell bam file in IGV, that transcript has only around 50 reads mapped

You scale goes up to 51, so you have up to 51 at any one position. If you have 2 non-overlapping reads, your maximum coverage will be 1, but you actually have 2 counts. Even if you look at just the first exon, you'll see that there are many non-overlapping reads, so you should have more than 50 counts for that exon alone.

0
Entering edit mode

Thanks. I missed that logic of independent read counts.

3
Entering edit mode
16 months ago
Rob 4.9k

Since this is full length without UMIs, salmon would be the tool to use here. Out of curiosity, how did you obtain the pile-up? How do you know those are all reads coming from this transcript? Is it a single transcript gene? You can always align the reads with STAR and get a bam file using —quantMode TranscriptomeSAM if you want quantification from those alignments.

0
Entering edit mode

Its aligned with STAR and loaded into IGV. But salmon was run with raw fastq-files and transcriptome index. So I don't need to use any special scaling factors for scRNA data, even though the total library sizes are way less than bulk rna-seq ?. Usually the scRNA tools are using 10,000 scaling factor instead of 1million, so I was just concerned. —quantMode TranscriptomeSAM seems to be a good idea. Thanks for quick response.

0
Entering edit mode

How did you import the Salmon counts into Seurat? With Alevin, it's recommended that the counts be imported without length correction, but here since this is full-length I am assuming we would want some type of length correction as it is done with tximport. Perhaps still using txi$counts but with countsFromAbundance=lengthScaledTPM? This would likely change your downstream findings. @Rob perhaps you can confirm if my suggestion above is correct? I don't see much mention on this online (as default txi$counts is not corrected from Salmon with tximport) but given that it's only recommended to not apply length correction to 3' protocols, I think this info could be helpful to the OP.

0
Entering edit mode

Wouldn't TPMs (txi\$abundance) be better for full-length protocols?

0
Entering edit mode

I don't think so... (could be wrong, open to feedback). I believe that providing the counts seems to be the route to take mainly based on: 1) the fact that counts returned countsFromAbundance=lengthScaledTPM are corrected for length bias in addition to depth/library size 2) In Seurat, there is an option to either create a Seurat object from raw counts or TPMs (nothing specific to abundances, but I get why you would say that). 3) This paper by Soneson and Robinson seems to have used length-scaled TPMs throughout their study - e.g.:

First:

the rounded length-scaled TPMs for all genes with at least two nonzero counts are used as input to the simulator, and a data set with the same number of genes is generated

and later also mentioned:

Seurat is applied using either the default ‘bimod’ likelihood ratio test (applied to the length-scaled TPMs, which are log-normalized internally) ...

There are other mentions of this in the manuscript. So that's my conclusion. Would love to hear more feedback on this since there isn't really a lot of discussion on using tximport with full-length seq protocols (usually alevin info gets pulled up)

0
Entering edit mode

I think the name of that parameter is somewhat confusing (although I don't know if there is a better option that is still a reasonable number of characters). The DTU paper has a summary:

The tximport package offers three methods for producing count matrices from transcript-level quantification files, as described in detail in Soneson et al., and already mentioned in the introduction. To recap these are: (1) original estimated counts with effective transcript length information used as a statistical offset, (2) generating “counts from abundance” by scaling TPM abundance estimates per sample such that they sum to the total number of mapped reads (scaledTPM), or generating “counts from abundance” by scaling TPM abundance estimates first by the average effective transcript length over samples, and then per sample such that they sum to the total number of mapped reads (lengthScaledTPM).

I agree it would be great to get some clarification, especially with single-cell where it's difficult to really make any transcript-level determinations.

1
Entering edit mode

I like to consolidate Bioc related questions on the other support site. Feel free for anyone to post a tximport question there.

0
Entering edit mode

Thanks Michael. I will go ahead and post a topic there to continue this conversation. I think at this point it's clear the community could benefit from further discussions on this.

2
Entering edit mode

^ added a question for follow-up in the Bioconductor forum for those who are interested: https://support.bioconductor.org/p/132550/