Detecting expression in intergenic regions
2
3
Entering edit mode
5.1 years ago
A. Domingues ★ 2.6k

Background

We observed some RNA-seq coverage in regions outside annotated genes - let's called them intergenic. This expression appears to be more pronounced, or unique, to a particular condition.

Goal

Find those genomic regions that have higher coverage than expected by random noise alone, along with a read count value (expression). We are looking to identify those regions with high resolution, but rather a broad overview to:

• test whether or not there is a trend for more intergenic expression in some conditions;

• intersect those expressed intergenic regions with other relevant genomic features.

Data

Paired-end, total RNA-seq, Vertebrate species, not human.

Possible strategies

A, näive:

• divide the genome in windows (size?)
• remove regions containing genes +/- 5kb
• set background: randomly select X regions (1000) with 100 permutations to find distribution of background. Define cut-off as mean (or median + 2*SD).
• Use cut-off to select intergenic regions with high expression. Merge those within 1kb.

B, fancier following a histone mark-style approach:

• Use csaw to calculate coverage using sliding-window (size?)
• remove bins containing genes +/- 5kb
• median coverage across those bins used to filter "expressed regions" (I could also use a permutation approach here)

Question(s)

1. Does any of the above options sound reasonable for what I trying to accomplish?
2. Is there some detail missing?

For the window sizes I was thinking about using the average size of exons, since using the size of transcripts could lead to really large windows. Also, if the expression is "transcript-like", short exons - variable length intron - it could lead to large discrepancies in the average coverage and some regions might be missed.

RNA-Seq Bedtools genome • 2.2k views
0
Entering edit mode

Would DERfinder be of use to you?

Here, we propose a novel method that first identifies differentially expressed regions (DERs) of interest by assessing differential expression at each base of the genome. The method then segments the genome into regions comprised of bases showing similar differential expression signal, and then assigns a measure of statistical significance to each region.

https://github.com/alyssafrazee/derfinder

0
Entering edit mode

It just might. A bit more evolved than what I had in mind, but could give out extra useful information. I look into it, cheers.

2
Entering edit mode
2.7 years ago
A. Domingues ★ 2.6k

To answer my own question, I ended using two complementary approaches:

1. was to simply count total intergenic reads per condition and compare those (normalizing to mapped reads). The picture as pretty obvious with some conditions showing a lot more intergenic expression than others.

2. I also used derdinder to define expressed regions irrespective of annotation, and them used heuristics to select those of interest. We basically had a handful of regions manually annotated, and from those I could derive rules to select other regions of interest. In the end, the rules were stricter than the manual annotation.

Anyway, derfinder was a great tool for this purpose, though it was a little fiddly, and there was lot of trial and error with visual inspection on IGV to get the right mix of sensitivity and specificity of expressed regions.

0
Entering edit mode

1
Entering edit mode

Will do. It took a while to get the problem sorted tbh. In those two years there was a lot more sequencing done, a number of other approaches tested, and even derfinder took a while to get to grips with. Not to mention all the other pesky parallel projects getting in the way :)

1
Entering edit mode
5.1 years ago

I really wanted to give a good answer and this is the best that I can do. I appreciate your interest in this particular area of research, as it is an intriguing area.

If you process your RNA-seq data with the latest methods, then anything that makes the final list of transcripts could be viewed as already a good candidate. In a condition or state where there is active transcription, you will always find mRNAs deriving from intergenic space where genuine transcription has occurred - these will typically be at low levels and could be viewed as transcriptional 'noise'. For example, the oestrogen receptor alpha protein is a very potent transcriptional activator and binds to regions all across the genome, which results in mRNA transcription of target genes but also that of many 'novel' ncRNAs that are expressed at low levels. These may very well have no function but to occupy volume in the nucleus and then get digested quickly.

The key is to not just find these transcripts but to infer their functionality. Thus, you may be missing the point with your approach in that you are focusing too much on just identifying these transcripts when we've already moved onto the question of 'What are these transcripts doing and what role do they have in disease?'

Thus, I would implore you to consider a new batch of questions:

• to which coding mRNAs are these transcripts statistically significantly correlated?
• do these transcripts overlap with known enhancer regions or overlap with other histone-binding sites?
• does the expression of these transcripts alter based on the genotype of nearby SNPs?
• are these transcripts expressed from the sense or antisense strand? - if antisense, do they interfere with the transcription of the sense mRNA?

I don't see anything majorly incorrect about your approach but it looks more like a ChIP-seq experimental set-up, in which case you may also consider exploring deepTools. Following your approach, it would possibly give a generalised view of transcription across the genome, and, following the ChIP-seq idea, you could identify large regions that are significantly highly expressed over others. I'm not sure what this would add overall to what we already know, though.

As you've got 'total RNA-seq, also be aware that ribosomal RNA will dominate your signal.

For the window sizes I was thinking about using the average size of exons, since using the size of transcripts could lead to really large windows. Also, if the expression is "transcript-like", short exons - variable length intron - it could lead to large discrepancies in the average coverage and some regions might be missed.

You would be criticised for that. Then again, people criticise everything. ncRNAs are single-exon and can range in size from a dozen bases up to a few kilobase.

0
Entering edit mode

Thank you for the insightful comments/suggestions Kevin. Most of the questions about functionality were already in my follow-up list (and then some) and the ones that aren't is because they are not relevant to our model (non-human or mammalian). We also have a pretty good biological hypothesis, in fact this experiment was designed to answer it partially - this is not a fishing expedition. Since transcriptional noise in the organism and biological process we studying is not well characterized or at all, we first need to identify those regions before we can do the follow-up question addressing functionality. Even if such list was available, it would be of transcriptional regions in a particular tissue/condition, and I a not sure of it's relevance for any given hypothesis in an unrelated tissue/disease. I assume finding these regions is not trivial because extensive search did not reveal how to do it. Do you know of any tips or papers that address the original question? It would be really helpful.

If you process your RNA-seq data with the latest methods, then anything that makes the final list of transcripts could be viewed as already a good candidate.

Do you mean _de novo_ transcriptome assembly to find those new regions?

You would be criticised for that.

For using small windows and then merging them if they "close" to each other? I trying to approach this not assuming anything in relation to the class or structure of the "genes" we will find. My guess is that these transcriptional units will not have very stable limits/structure as in conventional genes, and that is also why I just want broad regions rather then trying to identify transcript start/end with high resolution.

1
Entering edit mode

Hey, yes, I admit that I had answered your question taking the view that it's human or some other species that has already been characterised. If it has not yet been characterised, then it makes it more interesting, of course.

Identifying transcription from these regions can be done through de novo transcriptome assembly using HISAT2 and then StringTie (previously it was TopHat2 and Cufflinks). If there already exists a reference genome for this species, then great (it seems like there is already a well-defined genome and coding transcriptome).

My guess is that these transcriptional units will not have very stable limits/structure as in conventional genes, and that is also why I just want broad regions rather then trying to identify transcript start/end with high resolution.

Yes, that makes sense. It would be great to follow-up the work with a ChIP experiment in order to see if markers of active transcription overlap with your regions.

Could be a very great publication!

1
Entering edit mode

Yes, that makes sense. It would be great to follow-up the work with a ChIP experiment in order to see if markers of active transcription overlap with your regions.

ChIP-seq is very very difficult because we are dealing with small number of cells (~1000), sub-population of cells during early embryogenesis. We do have ATAC-seq though to at least see regions that could be available for transcription. It didn't work very well in our first try (very low mappability), but that is an issue for another post.

Could be a very great publication!

We are also hoping for that :)