Question: How to know the position of various transcripts of a given gene in the genomic sequence of that gene?
0
gravatar for seta
2.2 years ago by
seta1.0k
Sweden
seta1.0k wrote:

Hi all friends,

I have a dumb question, sorry for it. I focus on a gene with multiple transcripts from human, I want to know the position (start and stop points) of each transcript in the genomic sequence of the gene. Could you please help me what I can do?

Thank you

transcript position sequence gene • 1.2k views
ADD COMMENTlink modified 2.2 years ago by Carlo Yague4.2k • written 2.2 years ago by seta1.0k

I assume you have a sam file of the alignments? Here is the pdf for the format: https://samtools.github.io/hts-specs/SAMv1.pdf

Column 4 holds position information of the aligned read, and length of the read can be obtained from column 10. This should be enough information to filter all reads from a given position on a genome.

Hope that helps!

ADD REPLYlink written 2.2 years ago by playerraa0

This is not providing an answer for the question asked in the original post. I have moved this post to a comment.

ADD REPLYlink written 2.2 years ago by genomax55k
1
gravatar for harold.smith.tarheel
2.2 years ago by
United States
harold.smith.tarheel4.2k wrote:

That information is contained in the annotation files (GTF or GFF) associated with the genome. For a single gene, it's probably easiest to use ENSEMBL's BioMart tool.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by harold.smith.tarheel4.2k
3

For a single gene just search for it in Ensembl and then click on "show transcript table". To see the start and end you have to click on the Ensembl Transcript ID in the table.

ADD REPLYlink written 2.2 years ago by genomax55k

Much faster than my answer. Thumbs up !

PS : in the end, not so fast depending on the number of alternative transcripts that you have to click one by one ^^

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Carlo Yague4.2k

Much faster if one wants to guage the complexity of the problem (are there a few transcripts or a hundred) but biomart is the way to go if one needs to extract/use that information for something downstream :-)

ADD REPLYlink written 2.2 years ago by genomax55k

A hundred transcripts? Not sure if that number has been annotated for any human gene ;) I know of DMD with 30 plus transcripts.

ADD REPLYlink written 2.2 years ago by Denise - Open Targets4.6k

Ah, yea you can find gene information here too. The question was a little ambiguous, but if you are only interested in the positions of known or predicted genes for a genome, then the gtf is the way to go.

You can pull them from here: http://genome.ucsc.edu/cgi-bin/hgTables

ADD REPLYlink written 2.2 years ago by playerraa0

These are the links to the Ensembl annotation in GTF and GFF3. BioMart is available as a web interface. You can access it programmatically if you know R (biomaRt), through the (BioMart Perl API) and/or the BioMart RESTful access.

ADD REPLYlink written 2.2 years ago by Denise - Open Targets4.6k
1
gravatar for Carlo Yague
2.2 years ago by
Carlo Yague4.2k
Belgium
Carlo Yague4.2k wrote:

1) Get the data

  • Go to BioMart
  • Select Ensembl Genes as database, Homo sapiens as dataset
  • Then on the left panel click on Attributes then click on GENE then thick gene start, gene end, transcript start, transcript end, gene strand.
  • on the left panel click on Filter then click on GENE then type your favorite gene ID in the appropriate field
  • on top left panel click on Results then download.

2) Open it in Excel, R or whatever you like and substract the gene START and END to the transcripts START and END to get the position of transcripts relatively to the full gene. Done.


PS : Note that the START position is only the real start (like transcription start site) if the strand is +. When its on the minus strand, then the END position is the TSS.

ADD COMMENTlink written 2.2 years ago by Carlo Yague4.2k

Thank you for your complete response. I did it and all transcripts were on the minus strand. Just for making sure, with below information, please kindly tell me the position of "ENST00000346798" relative to the full gene.

Ensembl Gene ID Ensembl Transcript ID   Gene Start (bp) Gene End (bp)   Transcript Start (bp)   Transcript End (bp) Strand
ENSG00000142192 ENST00000346798 25880550    26171128    25880550    26170654    -1

Thank you

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by seta1.0k
1

What do you mean by "full gene"? The longest transcript? The most 5' and 3' transcript ends (which may be derived from different isoforms)? The most 5' and 3' exon endpoints?

Each annotation contains information for one transcript (including 5' and 3' UTRs) and corresponding coding sequence (gene). In your example, the transcript begins 474 nucleotides (26171128 - 26170654) before the gene, and both end at the same position (25880550).

ADD REPLYlink written 2.2 years ago by harold.smith.tarheel4.2k

I came to the same results with that gene and transcript.

I guess "full gene" is the most 5' and 3' transcript ends or at least this is my understanding of how genes are annotated.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Carlo Yague4.2k

My mean was the corresponding (full) gene for the various transcript. I want to just make sure about 474 nt before the gene, thank you for your help.

ADD REPLYlink written 2.2 years ago by seta1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 712 users visited in the last hour