Question: What's the number at the begining of some lines of transcriptome GTF file means ?
0
gravatar for jack
4.5 years ago by
jack750
Germany
jack750 wrote:

Hi,

 

I need to parse human GTF file for my work. I downloaded it from Ensembel.

Basially I don't know what is the number "1"  means at the begging of some lines ?

also if one gene codes more than one transcript, how can I  find it ?

Here is the first few lines of it :

1    havana    gene    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1    havana    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript";
1    havana    exon    11869    12227    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1";
1    havana    exon    12613    12721    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1";

ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 4.5 years ago by jack750
1
gravatar for Devon Ryan
4.5 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

1 is the chromosome (you'll also see 2, 3, 4, ... X, Y, MT and so on). To find genes that code for multiple transcripts, use a regex to extract the gene_id and the transcript_id. If a given gene_id has more than one transcript_id, then you have your answer.

ADD COMMENTlink written 4.5 years ago by Devon Ryan88k

I see, but seems bit messy. I want to read this file to create a table with 3 columns, like Gene name, coresponding transcript and transcription strart site. but this file is not understandable for me.

ADD REPLYlink written 4.5 years ago by jack750
2

welcome to bioinformatics - we feel your pain, this is your support group 

ADD REPLYlink written 4.5 years ago by Istvan Albert ♦♦ 79k
1

There's no reason to bother parsing a GTF file for that, just use Biomart. Just click on the results on that link and save them to a file, since I already created the query for you.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Devon Ryan88k
1

N.B., that link in my last comment had the gene_id rather than the name. Just switch that in the "attributes" section and then get the results of that. Mea culpa!

ADD REPLYlink written 4.5 years ago by Devon Ryan88k

I got it. thanks.  the other thing I'm  looking is that i want to add the 3'UTR sequence of the transcript to the file that i export from Biomart. How can I do this ? I looked to the attributes, but I couldn't find such option.

ADD REPLYlink written 4.5 years ago by jack750
1

I found it, and done :-)

ADD REPLYlink written 4.5 years ago by jack750
1

I should have refreshed before replying :)

ADD REPLYlink written 4.5 years ago by Devon Ryan88k
1

When you extract sequences, you get the results in a fasta file. What it does is place the other attributes you wanted (e.g., transcript ID) in the header for each line and it'll separate multiple attributes with a pipe ("|"). That's convenient enough to parse (really easy in biopython) if you really do want things in columns.

ADD REPLYlink written 4.5 years ago by Devon Ryan88k

I see, but how can I find the transcripts of the a given gene that have different 3' UTR sequnces? for example, if a given gene encode 4 transcripts, and 2 of them might have different 3' UTR sequences, I want to put these two transcripts together which have different 3'UTR sequences. I want to do this for all transcriptom.

ADD REPLYlink written 4.5 years ago by jack750
1

Either post-process the biomart output to aggregate sequences by gene and compare them or go back to parsing the GTF file and compare the coordinates of everything past the stop codon (be sure to take strand into account!). I expect that the former is easier.

ADD REPLYlink written 4.5 years ago by Devon Ryan88k

what you say is to general. the point is that, how can I compare the sequences ? and also if the transcripts of a given genes have diffrent cordinate, does it nessecerly implys that they have different 3' UTR sequences ?
 

ADD REPLYlink written 4.5 years ago by jack750

1. I'm too busy at the moment to write some code to do this for you. You'll need to do that yourself.

2. Jein, to steal a word from German. The only reasonable way for the coordinates to be different is if there's some difference in the underlying sequence. While it's possible that a UTR region is repeated exactly and therefore two different transcripts share the same UTR sequence but derive it from different genomic positions, the odds of that occurring are near 0. It be interesting to check this, though.

ADD REPLYlink written 4.5 years ago by Devon Ryan88k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1809 users visited in the last hour