Hello Everyone!!
I have been presented with a not-so-trivial problem I think and most of you who have handled the GFF file format would have definitely faced it. So I have a GFF file of a scaffolding level genome assembly of a strain and I am required to separate various features such as exons, genes, pseudogenes, tRNA, ncRNA, etc...
However, there is a teeny weeny problem:
First I decided to use the third column of the GFF file to separate these features. However, it turns out that for ncRNA is designated as exon at several places in the third column but the gbkey=ncRNA says it's ncRNA coding region as shown below, so should it be included in the ncRNA set. Besides, when it comes to pseudogene as well they are assigned gbkey=Gene
NZ_QVHU01000011.1 cmsearch exon 44029 44136 . + . ID=exon-DW209_RS16640-1;Parent=rna-DW209_RS16640;Dbxref=RFAM:RF00034;gbkey=ncRNA;gene=rprA;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS16640;product=antisense sRNA RprA
NZ_QVHU01000004.1 RefSeq gene 149449 149632 . + . ID=gene-DW209_RS09040;Name=ssrS;gbkey=Gene;gene=ssrS;gene_biotype=ncRNA;locus_tag=DW209_RS09040;old_locus_tag=DW209_09045
NZ_QVHU01000004.1 cmsearch ncRNA 149449 149632 . + . ID=rna-DW209_RS09040;Parent=gene-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA
NZ_QVHU01000004.1 cmsearch exon 149449 149632 . + . ID=exon-DW209_RS09040-1;Parent=rna-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA
Also, sometimes the same region is defined as a gene, exon & ncRNA, or tRNA. So would it be appropriate to consider it in all three categories? I have read about the GFF3 format from here and here
Thanks, @Juke34. I will check this out and will help you know if it worked for me.