Question: Processing GFF file for feature extraction
0
gravatar for rohitsatyam102
7 weeks ago by
rohitsatyam102200 wrote:

Hello Everyone!!

I have been presented with a not-so-trivial problem I think and most of you who have handled the GFF file format would have definitely faced it. So I have a GFF file of a scaffolding level genome assembly of a strain and I am required to separate various features such as exons, genes, pseudogenes, tRNA, ncRNA, etc...

However, there is a teeny weeny problem:

First I decided to use the third column of the GFF file to separate these features. However, it turns out that for ncRNA is designated as exon at several places in the third column but the gbkey=ncRNA says it's ncRNA coding region as shown below, so should it be included in the ncRNA set. Besides, when it comes to pseudogene as well they are assigned gbkey=Gene

NZ_QVHU01000011.1 cmsearch exon 44029 44136 . + . ID=exon-DW209_RS16640-1;Parent=rna-DW209_RS16640;Dbxref=RFAM:RF00034;gbkey=ncRNA;gene=rprA;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS16640;product=antisense sRNA RprA

NZ_QVHU01000004.1 RefSeq gene 149449 149632 . + . ID=gene-DW209_RS09040;Name=ssrS;gbkey=Gene;gene=ssrS;gene_biotype=ncRNA;locus_tag=DW209_RS09040;old_locus_tag=DW209_09045

NZ_QVHU01000004.1 cmsearch ncRNA 149449 149632 . + . ID=rna-DW209_RS09040;Parent=gene-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA

NZ_QVHU01000004.1 cmsearch exon 149449 149632 . + . ID=exon-DW209_RS09040-1;Parent=rna-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA

Also, sometimes the same region is defined as a gene, exon & ncRNA, or tRNA. So would it be appropriate to consider it in all three categories? I have read about the GFF3 format from here and here

sequence assembly gene • 151 views
ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by rohitsatyam102200
1
gravatar for Juke34
7 weeks ago by
Juke344.8k
Sweden
Juke344.8k wrote:

Those 3 features go together and represent a correct record:

NZ_QVHU01000004.1 RefSeq gene 149449 149632 . + . ID=gene-DW209_RS09040;Name=ssrS;gbkey=Gene;gene=ssrS;gene_biotype=ncRNA;locus_tag=DW209_RS09040;old_locus_tag=DW209_09045

NZ_QVHU01000004.1 cmsearch ncRNA 149449 149632 . + . ID=rna-DW209_RS09040;Parent=gene-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA

NZ_QVHU01000004.1 cmsearch exon 149449 149632 . + . ID=exon-DW209_RS09040-1;Parent=rna-DW209_RS09040;Dbxref=RFAM:RF00013;gbkey=ncRNA;gene=ssrS;inference=COORDINATES: profile:INFERNAL:1.1.1;locus_tag=DW209_RS09040;product=6S RNA

As for an mRNA that has a Parent gene and Childs (exon, UTR, CDS), here a ncRNA has as parent a gene and as a child an exon. Using agat_sp_separate_by_record_type.pl from AGAT you can separate by record types. E.g all ncRNA with the parent and child will end up in a dedicated file.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Juke344.8k

Thanks, @Juke34. I will check this out and will help you know if it worked for me.

ADD REPLYlink written 7 weeks ago by rohitsatyam102200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1017 users visited in the last hour