What's the number at the begining of some lines of transcriptome GTF file means ?
1
0
Entering edit mode
8.0 years ago
jack ▴ 940

Hi,

I need to parse human GTF file for my work. I downloaded it from Ensembl.

Basically I don't know what is the number "1" means at the beginning of some lines ?

also if one gene codes more than one transcript, how can I find it ?

Here is the first few lines of it :

1    havana    gene    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1    havana    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript";
1    havana    exon    11869    12227    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1";
1    havana    exon    12613    12721    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1";
genome-sequencing RNA-Seq Assembly genomics • 2.8k views
ADD COMMENT
1
Entering edit mode
8.0 years ago

1 is the chromosome (you'll also see 2, 3, 4, ... X, Y, MT and so on). To find genes that code for multiple transcripts, use a regex to extract the gene_id and the transcript_id. If a given gene_id has more than one transcript_id, then you have your answer.

ADD COMMENT
0
Entering edit mode

I see, but seems bit messy. I want to read this file to create a table with 3 columns, like Gene name, corresponding transcript and transcription start site. but this file is not understandable for me.

ADD REPLY
2
Entering edit mode

Welcome to bioinformatics - we feel your pain, this is your support group

ADD REPLY
1
Entering edit mode

There's no reason to bother parsing a GTF file for that, just use Biomart. Just click on the results on that link and save them to a file, since I already created the query for you.

ADD REPLY
1
Entering edit mode

N.B., that link in my last comment had the gene_id rather than the name. Just switch that in the "attributes" section and then get the results of that. Mea culpa!

ADD REPLY
0
Entering edit mode

I got it. Thanks. The other thing I'm looking is that I want to add the 3'UTR sequence of the transcript to the file that I export from Biomart. How can I do this? I looked to the attributes, but I couldn't find such option.

ADD REPLY
1
Entering edit mode

I found it, and done :-)

ADD REPLY
1
Entering edit mode

I should have refreshed before replying :)

ADD REPLY
1
Entering edit mode

When you extract sequences, you get the results in a fasta file. What it does is place the other attributes you wanted (e.g., transcript ID) in the header for each line and it'll separate multiple attributes with a pipe ("|"). That's convenient enough to parse (really easy in biopython) if you really do want things in columns.

ADD REPLY
0
Entering edit mode

I see, but how can I find the transcripts of the a given gene that have different 3' UTR sequnces? for example, if a given gene encode 4 transcripts, and 2 of them might have different 3' UTR sequences, I want to put these two transcripts together which have different 3'UTR sequences. I want to do this for all transcriptom.

ADD REPLY
1
Entering edit mode

Either post-process the biomart output to aggregate sequences by gene and compare them or go back to parsing the GTF file and compare the coordinates of everything past the stop codon (be sure to take strand into account!). I expect that the former is easier.

ADD REPLY
0
Entering edit mode

What you say is too general. The point is that, how can I compare the sequences? And also if the transcripts of a given genes have different coordinate, does it necessarily imply that they have different 3' UTR sequences?

ADD REPLY
0
Entering edit mode
  1. I'm too busy at the moment to write some code to do this for you. You'll need to do that yourself.
  2. Jein, to steal a word from German. The only reasonable way for the coordinates to be different is if there's some difference in the underlying sequence. While it's possible that a UTR region is repeated exactly and therefore two different transcripts share the same UTR sequence but derive it from different genomic positions, the odds of that occurring are near 0. It be interesting to check this, though.
ADD REPLY

Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6