How to know the position of various transcripts of a given gene in the genomic sequence of that gene?
2
0
Entering edit mode
7.8 years ago
seta ★ 1.9k

Hi all friends,

I have a dumb question, sorry for it. I focus on a gene with multiple transcripts from human, I want to know the position (start and stop points) of each transcript in the genomic sequence of the gene. Could you please help me what I can do?

Thank you

gene transcript sequence position • 4.8k views
ADD COMMENT
0
Entering edit mode

I assume you have a sam file of the alignments? Here is the pdf for the format: https://samtools.github.io/hts-specs/SAMv1.pdf

Column 4 holds position information of the aligned read, and length of the read can be obtained from column 10. This should be enough information to filter all reads from a given position on a genome.

Hope that helps!

ADD REPLY
0
Entering edit mode

This is not providing an answer for the question asked in the original post. I have moved this post to a comment.

ADD REPLY
1
Entering edit mode
7.8 years ago

That information is contained in the annotation files (GTF or GFF) associated with the genome. For a single gene, it's probably easiest to use ENSEMBL's BioMart tool.

ADD COMMENT
3
Entering edit mode

For a single gene just search for it in Ensembl and then click on "show transcript table". To see the start and end you have to click on the Ensembl Transcript ID in the table.

ADD REPLY
0
Entering edit mode

Much faster than my answer. Thumbs up !

PS : in the end, not so fast depending on the number of alternative transcripts that you have to click one by one ^^

ADD REPLY
0
Entering edit mode

Much faster if one wants to guage the complexity of the problem (are there a few transcripts or a hundred) but biomart is the way to go if one needs to extract/use that information for something downstream :-)

ADD REPLY
0
Entering edit mode

A hundred transcripts? Not sure if that number has been annotated for any human gene ;) I know of DMD with 30 plus transcripts.

ADD REPLY
0
Entering edit mode

Ah, yea you can find gene information here too. The question was a little ambiguous, but if you are only interested in the positions of known or predicted genes for a genome, then the gtf is the way to go.

You can pull them from here: http://genome.ucsc.edu/cgi-bin/hgTables

ADD REPLY
0
Entering edit mode

These are the links to the Ensembl annotation in GTF and GFF3. BioMart is available as a web interface. You can access it programmatically if you know R (biomaRt), through the (BioMart Perl API) and/or the BioMart RESTful access.

ADD REPLY
1
Entering edit mode
7.8 years ago

1) Get the data

  • Go to BioMart
  • Select Ensembl Genes as database, Homo sapiens as dataset
  • Then on the left panel click on Attributes then click on GENE then thick gene start, gene end, transcript start, transcript end, gene strand.
  • on the left panel click on Filter then click on GENE then type your favorite gene ID in the appropriate field
  • on top left panel click on Results then download.

2) Open it in Excel, R or whatever you like and substract the gene START and END to the transcripts START and END to get the position of transcripts relatively to the full gene. Done.


PS : Note that the START position is only the real start (like transcription start site) if the strand is +. When its on the minus strand, then the END position is the TSS.

ADD COMMENT
0
Entering edit mode

Thank you for your complete response. I did it and all transcripts were on the minus strand. Just for making sure, with below information, please kindly tell me the position of "ENST00000346798" relative to the full gene.

Ensembl Gene ID Ensembl Transcript ID   Gene Start (bp) Gene End (bp)   Transcript Start (bp)   Transcript End (bp) Strand
ENSG00000142192 ENST00000346798 25880550    26171128    25880550    26170654    -1

Thank you

ADD REPLY
1
Entering edit mode

What do you mean by "full gene"? The longest transcript? The most 5' and 3' transcript ends (which may be derived from different isoforms)? The most 5' and 3' exon endpoints?

Each annotation contains information for one transcript (including 5' and 3' UTRs) and corresponding coding sequence (gene). In your example, the transcript begins 474 nucleotides (26171128 - 26170654) before the gene, and both end at the same position (25880550).

ADD REPLY
0
Entering edit mode

I came to the same results with that gene and transcript.

I guess "full gene" is the most 5' and 3' transcript ends or at least this is my understanding of how genes are annotated.

ADD REPLY
0
Entering edit mode

My mean was the corresponding (full) gene for the various transcript. I want to just make sure about 474 nt before the gene, thank you for your help.

ADD REPLY

Login before adding your answer.

Traffic: 2771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6