Question: How much of the genome is covered with RNA-sequencing?
1
gravatar for chen.leibson
18 days ago by
chen.leibson10
chen.leibson10 wrote:

When preforming a high-throughput RNA sequencing (with human samples), how many of the 3 billion base pairs of DNA will get covered after alignment and quality control?

In other words - how much of the genome can theoretically be reproduced from the RNA-seq?

I'm looking for even just a rough estimate, but if it helps, the samples that I'm interested in are from human brains, expressing ~16,000 genes, of which ~13,000 are protein coding.

I couldn't find an answer by googling, and will appreciate any help.

rna-seq alignment • 194 views
ADD COMMENTlink modified 17 days ago by Istvan Albert ♦♦ 84k • written 18 days ago by chen.leibson10
1

It is going to depend on quality of your libraries and the method used for making them. Since you are going to enrich/capture non-rRNA transcripts what gets captured/sampled in your library is fixed. In theory all such transcripts present in your sample have a chance of being captured in the library.

ADD REPLYlink modified 18 days ago • written 18 days ago by genomax89k

Thank you for your answer! This is actually not my data, I'm just using it to do some calculations. According to the article from which it is taken, they used Illumina Stranded Total RNA Prep with Ribo-Zero Plus for the library, on cortical samples. Does this help? Or, is there a way for this to be calculated maybe?

ADD REPLYlink modified 18 days ago • written 18 days ago by chen.leibson10
3
gravatar for Istvan Albert
17 days ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

I have looked at a high-quality brain sample that I had worked on.

samtools depth -a data.bam | awk ' $3>0 { count += 1  } END { print (count/NR) }'

It appears that the genome is covered at about 7.5% rate (command took 20 minutes to compute that)

It is true that the larger fraction of a genome may be observed transcribed into RNA, but those would not be picked up by a typical RNA-Seq experiment.

ADD COMMENTlink modified 17 days ago • written 17 days ago by Istvan Albert ♦♦ 84k

Can you provide some additional information about how many reads are aligned here? What is the length of the reads? How are the multi-mappers treated (allowed to multi-map or placed at one random location)?

I have seen numbers of between 10-20% coverage mentioned for single cell RNAseq but this appears to be even less than that. Interesting.

ADD REPLYlink written 17 days ago by genomax89k

21 million alignments, 150bp reads aligned with tophat2 (results are quite a few years old) - I think it does random placement for multi mapped reads.

This data is my go-to data for checking various RNA-Seq expectations since it is amazingly consistent across all replicates. Even the fragmentation in UTR etc are identical. (it is the first in the tracks below)

enter image description here

ADD REPLYlink modified 17 days ago • written 17 days ago by Istvan Albert ♦♦ 84k

It appears that the genome is covered at about 7.5% rate (command took 20 minutes to compute that)

In general, only 5-15% of the genome is actually transcribed in a (matured) tissue, I don't remember the reference but after work in the Illumina Body Map 2 I remember that range.

ADD REPLYlink written 17 days ago by JC11k

If you remember, can you please expand on the meaning of "mature tissue"? Is the distinction between pre- and post-natal tissue, or before and after reaching final size (aka adulthood), or something else? This might be extra important when considering the brain.

ADD REPLYlink written 17 days ago by chen.leibson10

The transcriptional 'programme' will differ depending on the cell cycle, tissue, and stage of development. In the brain, for example, we would make a distinction between mature and other astrocytes.

ADD REPLYlink written 16 days ago by Kevin Blighe65k

Thank you! This is very helpful. I'll try to do this on my data as well.

ADD REPLYlink written 17 days ago by chen.leibson10
2
gravatar for lieven.sterck
18 days ago by
lieven.sterck8.5k
VIB, Ghent, Belgium
lieven.sterck8.5k wrote:

this actually comes down to the fraction of genome that is transcribed. For human the encode project is a good start to get a number on this.

the total size of all regions that get transcribed from a genome is the upper limit (due to biological reasons not all potential transcripts are present at any moment in the cell) so what you will get from RNAseq alignment is usually lower than that (roughly ~60-70% of it)

there are even more factors in play: for instance typically rRNA will be depleted from your RNAseq analysis (so that is a fraction of the genome that is being transcribed and present in your sample but not/barely visible/ from RNAseq.

Why do you want to know actually?

ADD COMMENTlink modified 15 days ago • written 18 days ago by lieven.sterck8.5k

Thanks for your reply!

Just to make sure I understand you correctly - assuming about 90% of the genome gets transcribed, and the transcribed part of the genome from a single tissue (i.e. cortex in this case) will be ~60% of that - so ~54% of the genome, or about 1.62 billion bp?

And after that, how much of it would you expect to be captured?

The reason I want to know this is for a sort of enrichment analysis: some of the positions in this "population" of transcribed bp have been found to have a role in splicing, and I'm trying to figure out the population size.

ADD REPLYlink modified 18 days ago • written 18 days ago by chen.leibson10

in theory yes, but the 90% is likely quite an over-estimate.

Perhaps my answers was a bit short but then again, this is a quite debate topic and no real fixed numbers are available. My own guesstimate would be that 10-20% of the genome might get transcribed into functional things (== which you might pick up in an RNAseq analysis).

I'm far from a statistician but for some sort of enrichment analysis I would compare a 'positive' to a 'negative/neutral' sample and thus not work with a single sample data. (but I can be wrong here)

ADD REPLYlink written 17 days ago by lieven.sterck8.5k

You are, of course, right. I am actually using many samples for the analysis - and was mostly looking to get a feeling for the magnitude of the population that should be expected.

ADD REPLYlink written 17 days ago by chen.leibson10

Keep in mind that life has no rules... in the cell, interactions between transcription factors, enhancers, promoters, TSS, and DNA are not judged by the letters ATGC - they are judged by electrochemical and electromagnetic interactions in the context of the 3-dimensional chromatin structure. Molecules that can promote transcription are undoubtedly binding virtually 'everywhere' they possibly can where there is an attraction, but binding only becomes sufficiently strong at certain loci such that a sustained transcription of an entire gene can occur.

Brain tissue has very specific transcriptional profiles, so, the figure of 7.5% is likely different in other bodily tissues.

ADD REPLYlink written 16 days ago by Kevin Blighe65k

assuming about 90% of the genome gets transcribed, and the transcribed part of the genome from a single tissue (i.e. cortex in this case) will be ~60% of that - so ~54% of the genome, or about 1.62 billion bp?

Not sure you can extrapolate that way. As I said before there are several limiting steps. Your sample is a time slice (whatever happened to be expressed at that time point). Efficiency of library making and what got captured in the library (not possible/feasible to convert 100% of RNA you have to libraries) is second limit. Depth of sequencing (cost and diminishing returns considerations) used to sample the library would be the third limit. So you have losses happening at each step.

You could do an approximate calculation taking a standard "known" transcriptome and working backwards from your data to see what % you were able to recover.

ADD REPLYlink modified 17 days ago • written 17 days ago by genomax89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1951 users visited in the last hour