How to fix my broken GTF file?
2
0
Entering edit mode
5.9 years ago
SmallChess ▴ 620

I have an incomplete GTF file with lines such as:

chr1    hg38_ct_UserTrack_3545  exon    94353   94355   2109    +   .   gene_id "R2_66"; transcript_id "R2_66_1";


This describes an exon. All the lines in my incomplete GTF file describe an exon or CDS.

Question:

I want to fix my GTF file. For instance, I need to do something like:

chrI   hg38_ct_UserTrack_3545  gene      6790136 6808198 .       +       .       gene_id "R1_102";


and

chrI   hg38_ct_UserTrack_3545  transcript      6790136 6808198 .       +       .       transcript_id "R1_102";


I would like to add annotation at the transcript and gene level. What's the best way?

gtf • 2.9k views
0
Entering edit mode

I think you can get gene region on genome by your gtf file. Ignore UTR region. You can try to get the start and end position of a gene. The start and end position should in the first and last exon's boundary, these exons should belong to the same gene.

2
Entering edit mode
5.9 years ago
Ryan Dale 5.0k

The gffutils python package can help with this: https://daler.github.io/gffutils

It will infer the gene and transcript extents; doing so is the default behavior for GTF files. The following will import the GTF into a sqlite3 database (which you can use for other things if needed) and then output a new file with everything including gene and transcript features:

import gffutils
db = gffutils.create_db('my.gtf', 'my.gtf.db')
with open('fixed.gtf', 'w') as fout:
for feature in db.all_features():
fout.write(str(feature) + '\n')


I should point out that what you call an "incomplete" file is actually an on-spec GTF file -- the specification says not to include gene or transcript features.

1
Entering edit mode
5.9 years ago

Have you tried BioMart @ Ensembl? You can DL the GTF file with the annotations from this http://www.ensembl.org/info/data/ftp/index.html

Then you could load this to R, if you use, or make a AWK with Bash: for example

awk -F"\t" '$3 ~ /exon|CDS|start_codon|stop_codon|5UTR|3UTR|inter|inter_CNS/{print}' data > data3  Where -F = field separator, in this case a tab$3 = the field you want to look in /pattern/ = the pattern to search {print} = action to take data = file to look pattern at data3 = output file

hope it works fine!