Question: How to fix my broken GTF file?
0
gravatar for SmallChess
19 months ago by
SmallChess460
Australia
SmallChess460 wrote:

I have an incomplete GTF file with lines such as:

chr1    hg38_ct_UserTrack_3545  exon    94353   94355   2109    +   .   gene_id "R2_66"; transcript_id "R2_66_1";

This describes an exon. All the lines in my incomplete GTF file describe an exon or CDS.

Question:

I want to fix my GTF file. For instance, I need to do something like:

chrI   hg38_ct_UserTrack_3545  gene      6790136 6808198 .       +       .       gene_id "R1_102";

and

chrI   hg38_ct_UserTrack_3545  transcript      6790136 6808198 .       +       .       transcript_id "R1_102";

I would like to add annotation at the transcript and gene level. What's the best way?

gtf • 694 views
ADD COMMENTlink modified 19 months ago by Ryan Dale4.8k • written 19 months ago by SmallChess460

I think you can get gene region on genome by your gtf file. Ignore UTR region. You can try to get the start and end position of a gene. The start and end position should in the first and last exon's boundary, these exons should belong to the same gene.

ADD REPLYlink written 19 months ago by l0o0160
2
gravatar for Ryan Dale
19 months ago by
Ryan Dale4.8k
Bethesda, MD
Ryan Dale4.8k wrote:

The gffutils python package can help with this: https://daler.github.io/gffutils

It will infer the gene and transcript extents; doing so is the default behavior for GTF files. The following will import the GTF into a sqlite3 database (which you can use for other things if needed) and then output a new file with everything including gene and transcript features:

import gffutils
db = gffutils.create_db('my.gtf', 'my.gtf.db')
with open('fixed.gtf', 'w') as fout:
    for feature in db.all_features():
        fout.write(str(feature) + '\n')

I should point out that what you call an "incomplete" file is actually an on-spec GTF file -- the specification says not to include gene or transcript features.

ADD COMMENTlink written 19 months ago by Ryan Dale4.8k
1
gravatar for gandrescabrera
19 months ago by
gandrescabrera80 wrote:

Have you tried BioMart @ Ensembl? You can DL the GTF file with the annotations from this http://www.ensembl.org/info/data/ftp/index.html

Then you could load this to R, if you use, or make a AWK with Bash: for example

awk -F"\t" '$3 ~ /exon|CDS|start_codon|stop_codon|5UTR|3UTR|inter|inter_CNS/{print}' data > data3

Where -F = field separator, in this case a tab $3 = the field you want to look in /pattern/ = the pattern to search {print} = action to take data = file to look pattern at data3 = output file

hope it works fine!

ADD COMMENTlink written 19 months ago by gandrescabrera80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 820 users visited in the last hour