Parsing GTF file - Help!
6.1 years ago
espop23 ▴ 60

I have data from gencode which looks like this: 

     chr1    ENSEMBL    gene    17369    17436    .    -    .    gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
     chr1    ENSEMBL    gene    30366    30503    .    +    .    gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
     chr1    ENSEMBL    gene    157784    157887    .    -    .    gene_id "ENSG00000222623.1"; gene_type "snRNA"; gene_status "KNOWN"; gene_name "RNU6-1100P"; level 3;


I have tried using gffutils, but I get an error with this code: 

    import gffutils

    db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')

  # ['CDS', 'exon', 'gene', 'start_codon', 'stop_codon', 'transcript']

   # Here's how to write genes out to file
   with open('sRNA.gene.gtf', 'w') as fout:
       for gene in db.features_of_type('gene'):
       fout.write(str(gene) + '\n')


Where it says 

ImportError: cannot import name 'feature'. 


Can someone please offer suggestions on the best way to parse such GTF files? 

If I use your example GTF file and your example code, it works -- with the exception that the list of featuretypes is ['gene'] since only gene features are in your example GTF.

Can you provide a minimal example (complete code and input) that reproduces the error?

More generally, what is your end goal? It may not be necessary to create a database. For example, you can use gffutils just for parsing a GTF file (with the gffutils.FeatureIterator class).

Last, see some hints at A: GFFutils very slow at creating database file. Any Idea why..? for using GENCODE GTF files which now already include features for genes and transcripts.

