Question

Interpretation of reads

2

Entering edit mode

2.8 years ago

Alicia ▴ 20

Hello! I am trying to understand a lexical detail about the count matrix used in a single-cell RNAseq experiment. I know each entry represents the number of reads mapped to a particular gene in a particular cell. In fact, I have a doubt about the exact meaning of a "read".

If I understood correctly, at the beginning of a scRNA-seq experiment, you have to break the transcripts in small pieces (because the sequencer cannot sequence too long fragments). How do we call those small pieces that we have before PCR amplification? Read? Fragment? Both? Then, we have to convert those RNA pieces into DNA and amplify them with PCR: we obtain a lot of copies that, I believe, are called "amplicons". Then we sequence all those amplicons. At that point, I also have a doubt : are all those pieces (including all the duplicates) then written in the FASTQ file? How could we know which amplicons come from the same original piece of RNA?

Once we have the FASTQ files, we can align it to our genome, then we obtain a BAM file, and at this point we create the count matrix, by counting how many lines in the BAM file correspond to an exon of each gene.

So, I would like to know if an entry in a usual count matrix represents:

The number of original "pieces" of all the transcripts matching the region of the gene before amplification? (if this is the case, how can we retrieve this number after amplification?)
The number of amplicons matching the region of the gene (therefore including all the duplicates)? (if this is the case, we assume that all the pieces were equally amplified so that those counts remain comparable?)

Many thanks!

interpretation scRNA-seq RNA-seq count-matrix amplification • 1.2k views

ADD COMMENT • link updated 2.8 years ago by dsull ★ 5.8k • written 2.8 years ago by Alicia ▴ 20

score 1 · Answer 1 · 2021-06-16

Yes, those small pieces are "fragments" (not reads; reads are what you see in the final FASTQ file).

"are all those pieces (including all the duplicates) then written in the FASTQ file?" -- yes.

"How could we know which amplicons come from the same original piece of RNA?" -- this is what UMIs (unique molecular identifiers) are for.

Yes, the count matrix should represent "original pieces". Again, this is what UMIs are for.

score 0 · Answer 2 · 2021-06-16

How do we call those small pieces that we have before PCR amplification? Read? Fragment? Both?

It's called fragment

Then, we have to convert those RNA pieces into DNA and amplify them with PCR

Depends on the sequencing technology, after cDNA generated, the amplification could occur just to amplify the detection signal, as in the cluster generation on modern Illumina process (). Old technology required PCR amplification before doing sequencing, and methods required deduplication (for WGS/WXS) or some normalization methods in RNA-seq

The number of original "pieces" of all the transcripts matching the region of the gene before amplification?

It should represent a proportional signal, low counts mean low expression, higher counts mean higher expression levels for the gene/transcript

The number of amplicons matching the region of the gene (therefore including all the duplicates)? (if this is the case, we assume that all the pieces were equally amplified so that those counts remain comparable?)

There is some bias because GC content, or sequence complexity, but in general those bias are not because amplicon duplication with recent technologies