What's the number at the begining of some lines of transcriptome GTF file means ?
1
0
Entering edit mode
8.3 years ago
jack ▴ 960

Hi,

I need to parse human GTF file for my work. I downloaded it from Ensembl.

Basically I don't know what is the number "1" means at the beginning of some lines ?

also if one gene codes more than one transcript, how can I find it ?

Here is the first few lines of it :

1    havana    gene    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1    havana    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript";
1    havana    exon    11869    12227    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1";
1    havana    exon    12613    12721    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1";

genome-sequencing RNA-Seq Assembly genomics • 3.0k views
1
Entering edit mode
8.3 years ago

1 is the chromosome (you'll also see 2, 3, 4, ... X, Y, MT and so on). To find genes that code for multiple transcripts, use a regex to extract the gene_id and the transcript_id. If a given gene_id has more than one transcript_id, then you have your answer.

0
Entering edit mode

I see, but seems bit messy. I want to read this file to create a table with 3 columns, like Gene name, corresponding transcript and transcription start site. but this file is not understandable for me.

2
Entering edit mode

Welcome to bioinformatics - we feel your pain, this is your support group

1
Entering edit mode

There's no reason to bother parsing a GTF file for that, just use Biomart. Just click on the results on that link and save them to a file, since I already created the query for you.

1
Entering edit mode

N.B., that link in my last comment had the gene_id rather than the name. Just switch that in the "attributes" section and then get the results of that. Mea culpa!

0
Entering edit mode

I got it. Thanks. The other thing I'm looking is that I want to add the 3'UTR sequence of the transcript to the file that I export from Biomart. How can I do this? I looked to the attributes, but I couldn't find such option.

1
Entering edit mode

I found it, and done :-)

1
Entering edit mode

I should have refreshed before replying :)

1
Entering edit mode

When you extract sequences, you get the results in a fasta file. What it does is place the other attributes you wanted (e.g., transcript ID) in the header for each line and it'll separate multiple attributes with a pipe ("|"). That's convenient enough to parse (really easy in biopython) if you really do want things in columns.

0
Entering edit mode

I see, but how can I find the transcripts of the a given gene that have different 3' UTR sequnces? for example, if a given gene encode 4 transcripts, and 2 of them might have different 3' UTR sequences, I want to put these two transcripts together which have different 3'UTR sequences. I want to do this for all transcriptom.

1
Entering edit mode

Either post-process the biomart output to aggregate sequences by gene and compare them or go back to parsing the GTF file and compare the coordinates of everything past the stop codon (be sure to take strand into account!). I expect that the former is easier.

0
Entering edit mode

What you say is too general. The point is that, how can I compare the sequences? And also if the transcripts of a given genes have different coordinate, does it necessarily imply that they have different 3' UTR sequences?

0
Entering edit mode
1. I'm too busy at the moment to write some code to do this for you. You'll need to do that yourself.
2. Jein, to steal a word from German. The only reasonable way for the coordinates to be different is if there's some difference in the underlying sequence. While it's possible that a UTR region is repeated exactly and therefore two different transcripts share the same UTR sequence but derive it from different genomic positions, the odds of that occurring are near 0. It be interesting to check this, though.