Processing GFF file for feature extraction
1
0
Entering edit mode
7 months ago

Hello Everyone!!

I have been presented with a not-so-trivial problem I think and most of you who have handled the GFF file format would have definitely faced it. So I have a GFF file of a scaffolding level genome assembly of a strain and I am required to separate various features such as exons, genes, pseudogenes, tRNA, ncRNA, etc...

However, there is a teeny weeny problem:

First I decided to use the third column of the GFF file to separate these features. However, it turns out that for ncRNA is designated as exon at several places in the third column but the gbkey=ncRNA says it's ncRNA coding region as shown below, so should it be included in the ncRNA set. Besides, when it comes to pseudogene as well they are assigned gbkey=Gene

NZ_QVHU01000011.1 cmsearch exon 44029 44136 . + . ID=exon-DW209_RS16640-1;Parent=rna-DW209_RS16640;Dbxref=RFAM:RF00034;gbkey=ncRNA;gene=rprA;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS16640;product=antisense sRNA RprA

NZ_QVHU01000004.1 RefSeq gene 149449 149632 . + . ID=gene-DW209_RS09040;Name=ssrS;gbkey=Gene;gene=ssrS;gene_biotype=ncRNA;locus_tag=DW209_RS09040;old_locus_tag=DW209_09045

NZ_QVHU01000004.1 cmsearch ncRNA 149449 149632 . + . ID=rna-DW209_RS09040;Parent=gene-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA

NZ_QVHU01000004.1 cmsearch exon 149449 149632 . + . ID=exon-DW209_RS09040-1;Parent=rna-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA

Also, sometimes the same region is defined as a gene, exon & ncRNA, or tRNA. So would it be appropriate to consider it in all three categories? I have read about the GFF3 format from here and here

assembly sequence Assembly gene • 295 views
1
Entering edit mode
7 months ago
Juke34 ★ 5.5k

Those 3 features go together and represent a correct record:

NZ_QVHU01000004.1 RefSeq gene 149449 149632 . + . ID=gene-DW209_RS09040;Name=ssrS;gbkey=Gene;gene=ssrS;gene_biotype=ncRNA;locus_tag=DW209_RS09040;old_locus_tag=DW209_09045

NZ_QVHU01000004.1 cmsearch ncRNA 149449 149632 . + . ID=rna-DW209_RS09040;Parent=gene-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA

NZ_QVHU01000004.1 cmsearch exon 149449 149632 . + . ID=exon-DW209_RS09040-1;Parent=rna-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA


As for an mRNA that has a Parent gene and Childs (exon, UTR, CDS), here a ncRNA has as parent a gene and as a child an exon. Using agat_sp_separate_by_record_type.pl from AGAT you can separate by record types. E.g all ncRNA with the parent and child will end up in a dedicated file.

0
Entering edit mode

Thanks, @Juke34. I will check this out and will help you know if it worked for me.