Question: How to fix my broken GTF file?
gravatar for SmallChess
3.2 years ago by
SmallChess510 wrote:

I have an incomplete GTF file with lines such as:

chr1    hg38_ct_UserTrack_3545  exon    94353   94355   2109    +   .   gene_id "R2_66"; transcript_id "R2_66_1";

This describes an exon. All the lines in my incomplete GTF file describe an exon or CDS.


I want to fix my GTF file. For instance, I need to do something like:

chrI   hg38_ct_UserTrack_3545  gene      6790136 6808198 .       +       .       gene_id "R1_102";


chrI   hg38_ct_UserTrack_3545  transcript      6790136 6808198 .       +       .       transcript_id "R1_102";

I would like to add annotation at the transcript and gene level. What's the best way?

gtf • 1.4k views
ADD COMMENTlink modified 3.2 years ago by Ryan Dale4.9k • written 3.2 years ago by SmallChess510

I think you can get gene region on genome by your gtf file. Ignore UTR region. You can try to get the start and end position of a gene. The start and end position should in the first and last exon's boundary, these exons should belong to the same gene.

ADD REPLYlink written 3.2 years ago by l0o0200
gravatar for Ryan Dale
3.2 years ago by
Ryan Dale4.9k
Bethesda, MD
Ryan Dale4.9k wrote:

The gffutils python package can help with this:

It will infer the gene and transcript extents; doing so is the default behavior for GTF files. The following will import the GTF into a sqlite3 database (which you can use for other things if needed) and then output a new file with everything including gene and transcript features:

import gffutils
db = gffutils.create_db('my.gtf', 'my.gtf.db')
with open('fixed.gtf', 'w') as fout:
    for feature in db.all_features():
        fout.write(str(feature) + '\n')

I should point out that what you call an "incomplete" file is actually an on-spec GTF file -- the specification says not to include gene or transcript features.

ADD COMMENTlink written 3.2 years ago by Ryan Dale4.9k
gravatar for gandrescabrera
3.2 years ago by
gandrescabrera80 wrote:

Have you tried BioMart @ Ensembl? You can DL the GTF file with the annotations from this

Then you could load this to R, if you use, or make a AWK with Bash: for example

awk -F"\t" '$3 ~ /exon|CDS|start_codon|stop_codon|5UTR|3UTR|inter|inter_CNS/{print}' data > data3

Where -F = field separator, in this case a tab $3 = the field you want to look in /pattern/ = the pattern to search {print} = action to take data = file to look pattern at data3 = output file

hope it works fine!

ADD COMMENTlink written 3.2 years ago by gandrescabrera80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1663 users visited in the last hour