Question: Making gene annotation from a GTF file
0
gravatar for Karyo
5.5 years ago by
Karyo10
India
Karyo10 wrote:

Hi, I have downloaded a GTF formatted file from a database.  As you know, it is a tab delimited file with 9 columns and it goes like this:


#RefSeq_name Source Feature Start End Score Strand Frame Attribute
Chromosome5 file_source start_codon 4470284 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source stop_codon 4469688 4469690 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4470173 4470286 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4470173 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4470034 4470120 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4470034 4470120 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source exon 4469273 4469969 . - . gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source CDS 4469691 4469969 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source start_codon 4455593 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source stop_codon 4453288 4453290 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4455560 4455595 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4455560 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4455321 4455372 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4455321 4455372 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4454682 4455003 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4454682 4455003 . - 2 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4454473 4454620 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4454473 4454620 . - 1 gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source exon 4453288 4454397 . - . gene_id "ABC_00010"; transcript_id "ABC_00010T0";
Chromosome5 file_source CDS 4453291 4454397 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";


With this GTF file, a protein model FASTA file is made, which is the number of "transcript_id" in the attributes column. Because of the splicing, one "gene_id" can have more than one "transcript_id", so the numbers of "gene_id" and "transcript_id" are different. I would like to parse this GTF file to form more simple GTF format like this:


Chromosome5 file_source gene 4469688 4470286 . - 0 gene_id "ABC_00005"; transcript_id "ABC_00005T0";
Chromosome5 file_source gene 4453288 4455595 . - 0 gene_id "ABC_00010"; transcript_id "ABC_00010T0";


The feature (Column 3) as gene and column 4 and 5 as gene start and end sites, respectively, for each "transcript_id". It seems easy, but some genes do not have "start_codon" or "end_codon" features.

Does anyone know such GTF file parser making a gene annotation file with only "start_codon", "end_codon", "CDS" and "exon" information for each "transcript_id". Let me know, please.

 

genome gff gtf gene • 4.9k views
ADD COMMENTlink modified 5.5 years ago by Ryan Dale4.8k • written 5.5 years ago by Karyo10
4
gravatar for Ryan Dale
5.5 years ago by
Ryan Dale4.8k
Bethesda, MD
Ryan Dale4.8k wrote:

Inferring gene extent from GTF files can be done with gffutils (github, docs).

The docs for importing GTF files have some more detail for handling more difficult cases, but your example file looks straightforward. The following gist shows how to write a new file containing the inferred genes:

 

 

ADD COMMENTlink written 5.5 years ago by Ryan Dale4.8k

Wow! It works, Thank you @Daler!

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Karyo10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1936 users visited in the last hour