Question

How to convert gff3 format, PASApipeline

0

Entering edit mode

3.2 years ago

Ruixuan • 0

Hi all,

I'm doing UTR annotation with the use of PASA. Refer to https://github.com/PASApipeline/PASApipeline/wiki/PASA_genome_annotation.

A gff3 file is needed in this process. The sample gff3 it provides is like

gi|68711        TIGR    gene    5662    6138    .       +       .       ID=68711.t00001;Name=my protein name
gi|68711        TIGR    mRNA    5662    6138    .       +       .       ID=model.68711.m00001;Parent=68711.t00001
gi|68711        TIGR    exon    5662    6138    .       +       .       ID=68711.e00001;Parent=model.68711.m00001
gi|68711        TIGR    CDS     5662    6138    .       +       0       ID=5662_6138cds_of_68711.m00001;Parent=model.68711.m00001

But the gff3 file I downloaded from NCBI is like this

AP018495.1      DDBJ    region  1       381277  .       +       .       ID=AP018495.1:1..381277;Dbxref=taxon:2080449;gbkey=Src;isolation-source=A water/soil sample collected from the Jozankei Onsen;mol_type=genomic DNA
AP018495.1      DDBJ    CDS     261     647     .       -       0       ID=cds-BBI30141.1;Dbxref=NCBI_GP:BBI30141.1;Name=BBI30141.1;Note=ORF1;gbkey=CDS;product=hypothetical protein;protein_id=BBI30141.1
AP018495.1      DDBJ    CDS     706     1308    .       +       0       ID=cds-BBI30142.1;Dbxref=NCBI_GP:BBI30142.1;Name=BBI30142.1;Note=ORF2;gbkey=CDS;product=putative HD hydrolase;protein_id=BBI30142.1

You can see that in my file, I only have "CDS", but in its sample gff3 there are "gene, mRNA, exon, and CDS"; I was wondering how can I convert my file into the required format.

Thanks in advance

RNA-Seq Assembly • 763 views

ADD COMMENT • link updated 3.2 years ago by Juke34 8.5k • written 3.2 years ago by Ruixuan • 0

0

Entering edit mode

Cross-post at reddit: https://www.reddit.com/r/bioinformatics/comments/lpjmp5/how_to_convert_gff3_format_pasapipeline/

ADD REPLY • link 3.2 years ago by Ruixuan • 0

score 2 · Accepted Answer · 2021-02-22

2

Entering edit mode

3.2 years ago

Juke34 8.5k

From AGAT

agat_convert_sp_gxf2gxf.pl --gff input.gff --ct protein_id -o standardized_file.gff

In this example, in order to collect CDS features belonging to the same mRNAm, the value of the protein_id attribute will be used. Here if a gene/locus has several isoforms, they will all have their own gene parent (Apparently there is no way in your file to see if there are isoforms). Adding --merge_loci will merge mRNA that overlap in their CDS parts under the same parent gene.