Question: Detecting expression in intergenic regions
1
gravatar for A. Domingues
22 months ago by
A. Domingues2.1k
Dresden, Germany
A. Domingues2.1k wrote:

Background

We observed some RNA-seq coverage in regions outside annotated genes - let's called them intergenic. This expression appears to be more pronounced, or unique, to a particular condition.

Goal

Find those genomic regions that have higher coverage than expected by random noise alone, along with a read count value (expression). We are looking to identify those regions with high resolution, but rather a broad overview to:

  • test whether or not there is a trend for more intergenic expression in some conditions;

  • intersect those expressed intergenic regions with other relevant genomic features.

Data

Paired-end, total RNA-seq, Vertebrate species, not human.

Possible strategies

A, näive:

  • divide the genome in windows (size?)
  • count reads per window
  • remove regions containing genes +/- 5kb
  • set background: randomly select X regions (1000) with 100 permutations to find distribution of background. Define cut-off as mean (or median + 2*SD).
  • Use cut-off to select intergenic regions with high expression. Merge those within 1kb.

B, fancier following a histone mark-style approach:

  • Use csaw to calculate coverage using sliding-window (size?)
  • remove bins containing genes +/- 5kb
  • median coverage across those bins used to filter "expressed regions" (I could also use a permutation approach here)

Question(s)

  1. Does any of the above options sound reasonable for what I trying to accomplish?
  2. Is there some detail missing?

For the window sizes I was thinking about using the average size of exons, since using the size of transcripts could lead to really large windows. Also, if the expression is "transcript-like", short exons - variable length intron - it could lead to large discrepancies in the average coverage and some regions might be missed.

rna-seq genome bedtools • 1.1k views
ADD COMMENTlink modified 7 months ago • written 22 months ago by A. Domingues2.1k

Would DERfinder be of use to you?

Here, we propose a novel method that first identifies differentially expressed regions (DERs) of interest by assessing differential expression at each base of the genome. The method then segments the genome into regions comprised of bases showing similar differential expression signal, and then assigns a measure of statistical significance to each region.

https://academic.oup.com/biostatistics/article/15/3/413/223630

https://github.com/alyssafrazee/derfinder

ADD REPLYlink written 22 months ago by simon.vanheeringen170

It just might. A bit more evolved than what I had in mind, but could give out extra useful information. I look into it, cheers.

ADD REPLYlink written 22 months ago by A. Domingues2.1k
0
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe48k
Kevin Blighe48k wrote:

I really wanted to give a good answer and this is the best that I can do. I appreciate your interest in this particular area of research, as it is an intriguing area.

If you process your RNA-seq data with the latest methods, then anything that makes the final list of transcripts could be viewed as already a good candidate. In a condition or state where there is active transcription, you will always find mRNAs deriving from intergenic space where genuine transcription has occurred - these will typically be at low levels and could be viewed as transcriptional 'noise'. For example, the oestrogen receptor alpha protein is a very potent transcriptional activator and binds to regions all across the genome, which results in mRNA transcription of target genes but also that of many 'novel' ncRNAs that are expressed at low levels. These may very well have no function but to occupy volume in the nucleus and then get digested quickly.

The key is to not just find these transcripts but to infer their functionality. Thus, you may be missing the point with your approach in that you are focusing too much on just identifying these transcripts when we've already moved onto the question of 'What are these transcripts doing and what role do they have in disease?'

Thus, I would implore you to consider a new batch of questions:

  • to which coding mRNAs are these transcripts statistically significantly correlated?
  • do these transcripts overlap with known enhancer regions or overlap with other histone-binding sites?
  • does the expression of these transcripts alter based on the genotype of nearby SNPs?
  • are these transcripts expressed from the sense or antisense strand? - if antisense, do they interfere with the transcription of the sense mRNA?

I don't see anything majorly incorrect about your approach but it looks more like a ChIP-seq experimental set-up, in which case you may also consider exploring deepTools. Following your approach, it would possibly give a generalised view of transcription across the genome, and, following the ChIP-seq idea, you could identify large regions that are significantly highly expressed over others. I'm not sure what this would add overall to what we already know, though.

As you've got 'total RNA-seq, also be aware that ribosomal RNA will dominate your signal.

For the window sizes I was thinking about using the average size of exons, since using the size of transcripts could lead to really large windows. Also, if the expression is "transcript-like", short exons - variable length intron - it could lead to large discrepancies in the average coverage and some regions might be missed.

You would be criticised for that. Then again, people criticise everything. ncRNAs are single-exon and can range in size from a dozen bases up to a few kilobase.

ADD COMMENTlink modified 22 months ago • written 22 months ago by Kevin Blighe48k

Thank you for the insightful comments/suggestions Kevin. Most of the questions about functionality were already in my follow-up list (and then some) and the ones that aren't is because they are not relevant to our model (non-human or mammalian). We also have a pretty good biological hypothesis, in fact this experiment was designed to answer it partially - this is not a fishing expedition. Since transcriptional noise in the organism and biological process we studying is not well characterized or at all, we first need to identify those regions before we can do the follow-up question addressing functionality. Even if such list was available, it would be of transcriptional regions in a particular tissue/condition, and I a not sure of it's relevance for any given hypothesis in an unrelated tissue/disease. I assume finding these regions is not trivial because extensive search did not reveal how to do it. Do you know of any tips or papers that address the original question? It would be really helpful.

If you process your RNA-seq data with the latest methods, then anything that makes the final list of transcripts could be viewed as already a good candidate.

Do you mean _de novo_ transcriptome assembly to find those new regions?

You would be criticised for that.

For using small windows and then merging them if they "close" to each other? I trying to approach this not assuming anything in relation to the class or structure of the "genes" we will find. My guess is that these transcriptional units will not have very stable limits/structure as in conventional genes, and that is also why I just want broad regions rather then trying to identify transcript start/end with high resolution.

ADD REPLYlink written 22 months ago by A. Domingues2.1k
1

Hey, yes, I admit that I had answered your question taking the view that it's human or some other species that has already been characterised. If it has not yet been characterised, then it makes it more interesting, of course.

Identifying transcription from these regions can be done through de novo transcriptome assembly using HISAT2 and then StringTie (previously it was TopHat2 and Cufflinks). If there already exists a reference genome for this species, then great (it seems like there is already a well-defined genome and coding transcriptome).

My guess is that these transcriptional units will not have very stable limits/structure as in conventional genes, and that is also why I just want broad regions rather then trying to identify transcript start/end with high resolution.

Yes, that makes sense. It would be great to follow-up the work with a ChIP experiment in order to see if markers of active transcription overlap with your regions.

Could be a very great publication!

ADD REPLYlink written 22 months ago by Kevin Blighe48k
1

Yes, that makes sense. It would be great to follow-up the work with a ChIP experiment in order to see if markers of active transcription overlap with your regions.

ChIP-seq is very very difficult because we are dealing with small number of cells (~1000), sub-population of cells during early embryogenesis. We do have ATAC-seq though to at least see regions that could be available for transcription. It didn't work very well in our first try (very low mappability), but that is an issue for another post.

Could be a very great publication!

We are also hoping for that :)

ADD REPLYlink written 22 months ago by A. Domingues2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 894 users visited in the last hour