High downstream gene expression
2
0
Entering edit mode
14 months ago
yoser4 ▴ 10

hello everyone.

I am a novice in bioinformatics. I want to ask some questions.

I have some RNA-seq sequencing data. I use the bamCoverage tool to convert the bam file after STAR analysis into bigwig format, and then import it into UCSC Genome Browser. I found that there is a long sequence downstream of the gene with high expression. My problem is:

  1. What is the bioinformatics explanation for this highly expressed sequence downstream of the gene?
  2. Is this highly expressed sequence some components? (Under the guidance of others, I learned the question about poly (A) tails, and I don't know if I can explain it)
  3. How can I read the following information about UCSC Genome Browser? (such as GC percent, or Repeat)

With regard to the above questions, can anyone recommend relevant articles or posts? I want to learn.

enter image description hereAny help will be appreciated!

downstream High Gene expression • 1.0k views
ADD COMMENT
2
Entering edit mode
14 months ago

You don't show which species this is. I've taken a quick look at the human and mouse genome at those locations and neither have any genes around there. Looking at the pattern of expression, it looks to me like you have a series of short exons, followed by a long terminal exon, which you are marking as downstream of the gene.

Most genes in eukaryotes do have a long terminal exon, which includes the final part of the coding sequence, as well as the 3' UTR. In humans for example, the average UTR is as long as the coding sequence, and mostly is found in a single exon.

In less well studied genomes, the location of genes is often identified by aligning protein sequences from related organisms to the genome, and finding regions that could give rise to proteins of a similar sequence. This works pretty well, particularly when combined with gene prediction software working purely form sequence, and any EST/cDNA data that might exist for the species. However, while it works well for find open-reading frames/coding sequences (CDS), it works really badly, or not at all for annotating UTRs. Often in such genomes the region that is annotated as the "gene" just spans from the Start Codon to the Stop codon. But the expressed parts of genes span from the transcription start site to the transcription termination site, which might be several Kb up and downstream of the start/stop codons.

In better annotated genomes, much effort has been put into identifying the UTR sequence by using lots of cDNA/RNAseq/CAGE data, but such data just doesn't exist for non-model organisms. Even in humans its only been in the last few years that the UTR annotations have been anything like reliable. Even then, we are coming to realised that UTRs are highly variable from cell type to cell type and condition to condition, and that existing annotations only cover a subset of the total possible variation.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. Sorry, there are some omissions in the information. This is my problem. First of all, the species I study is sheep. Secondly, the left side of the red line I draw is indeed the gene region, while the right side of the red line is not in the gene region (the last continuous peak region, only a small part of which is really on the last exon of the gene, which is the place I separate with the red line). To sum up. The information I can confirm at present is that on the reference genome version I use, the region on the right side of the red line is really not on the gene, and there are no other annotated genes within a long enough distance downstream of the gene. Is that right? There are two main ideas: 1. This peak is the gene segment not included in the reference genome. 2. There are other "things" in this place.

ADD REPLY
0
Entering edit mode

I don't see genes in this location on the standard sheep genome either. The image you've included above doesn't show the gene annotation. But I'd bet good money that what you are seeing here is the annotation of the gene stops at the stop codon, but in reality the gene/transcript goes on for further than this.

ADD REPLY
1
Entering edit mode
14 months ago
  1. Impossible to tell from the information you are showing. Could be misalignments, read-through transcription (detection pipeline) and some more weird stuff. In particular, since the chromosome in question is chrX. Interpreting results of RNA-seq on the sex chromosomes warrants special caution, since there are e.g. several long non-coding RNAs like Xist that could give rise to spurious signals. Also mind that accurate assemblies of sex chromosomes are hard, so for many less studied organisms, the provided reference genome sequences should be used with a grain of salt. Best to thoroughly check the scientific literature first. For some organisms, the UCSC genome browser has a literature track, which allows you to find scientific publications easily that mention a specific region or sequence.
  2. It is surely no polyA-tail. While polyadenylation is very relevant, it happens post-transcriptionally. Only some artificial expression vector systems encode the poly-A signal genetically as part of the vector backbone.
  3. You can click on every track in the genome browser to obtain more information about it, e.g. the window size used to calculate the GC percent or the tools used to call and classify the repeats. To download the information, use the "Table Browser" in the "Tools" menu.

Good luck!

ADD COMMENT
0
Entering edit mode

Thank you for your reply. Your answer has helped me. I am a novice in scientific research, and the reading of literature is really poor.

Ps: By the way, continue to ask for help here. If anyone has read the literature on similar issues, please share it with me. Thank you veeeeeeeeeeeeeeeeeeeeery much.

ADD REPLY

Login before adding your answer.

Traffic: 2591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6