Question

How to remove repeated sequence region in Interproscan .gff3 with many ORFs?

0

Entering edit mode

2.9 years ago

acastill • 0

Hi, I'm having trouble reformatting the .gff3 output from Interproscan to .gtf, I used agat but it gave an error about repeated ID's. This online validator gives a similar error message: 'sequence region "tig00000001_377 (...) has already been defined'. Looking back in the original file, there are three instances of this sequence region, but each has a distinct ORF, like so:

##gff-version 3 
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##interproscan-version 5.50-84.0
##sequence-region tig00000001_377 1 2496
tig00000001_377 provided_by_user    nucleic_acid    1   2496    0   +   0   ID=tig00000001_377;Name=tig00000001_377;md5=7d26a317c817503d101bf1feadcb2f93
tig00000001_377_orf11327    getorf  ORF 2042    2347    0   -   0   Target=pep_tig00000001_377_2042_2347_r 1 102;ID=orf_tig00000001_377_2042_2347_r;Name=tig00000001_377_orf11327;md5=7d26a317c817503d101bf1feadcb2f93
##sequence-region tig00000001_377 1 2496
tig00000001_377 provided_by_user    nucleic_acid    1   2496    0   +   0   ID=tig00000001_377;Name=tig00000001_377;md5=7d26a317c817503d101bf1feadcb2f93
tig00000001_377_orf11359    getorf  ORF 3   266 0   -   0   Target=pep_tig00000001_377_3_266_r 1 88;ID=orf_tig00000001_377_3_266_r;Name=tig00000001_377_orf11359;md5=7d26a317c817503d101bf1feadcb2f93

Each also has polypeptide and protein_match features. I've never done this before, so I'm not sure how to proceed. Can the ORFs be somehow combined? Also- the third instance of this sequence region has many protein matches, mostly similar ('Ribonuclease' or 'Rnase').

Additionally, I tried gff3tools to 'fix' my gff3, but since the ORF is included in the ID column, the IDs in the .gff3 did not match the contig names in the original .fna file. Do I have to change the ID column too?

Lastly- this is actually just a small subset of the data. I wanted to run inteproscan on a metagenome but due to the large size I had to break it up into ~500 smaller files (using the command recommended here). So unfortunately I can't manually check what is happening with repeated sequences in each file.

In short- how can I make this into a valid .gff3, so that it can be turned into a .gtf? Any help will be greatly appreciated!

format region interproscan gff3 gtf • 1.1k views

ADD COMMENT • link updated 2.8 years ago by Juke34 8.5k • written 2.9 years ago by acastill • 0

1

Entering edit mode

What agat commands are you using? Are you trying this?

ADD REPLY • link 2.9 years ago by Arsenal ▴ 160

0

Entering edit mode

I had used agat_convert_sp_gff2gtf.pl. After your comment I tried agat_sp_keep_longest_isoform.pl, but the resulting file only contained the original file headers, plus the header of one sequence region. The script also gave the following warning messages:

INFO - Feature types not expected by the GFF3 specification:

nucleic_acid The feature type (3rd column in GFF3) is constrained to be either a term from the Sequence Ontology or an SO accession number. The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature ( SO:0000110) or an is_a child of it.

and

WARNING - Feature types not expected by AGAT:
nucleic_acid
orf The feature of these types (3rd column in GFF3) are skipped by the parser! To take them into account you must update the feature json files.
agat_convert_sp_gxf2gxf.pl --expose 
In which file to add my feature?
Feature level1 (e.g. gene, match, region): My feature has no parent => features_level1.json

Feature level2 (e.g. mrna, match_part, trna): My feature has one parent and children => features_level2.json.

Feature level3 (e.g. exon, intron, cds): My feature has one parent (the parent has also a parent) and no children => features_level3.json.

Feature level3 discontinuous (e.g. cds, utr): A single feature that exists over multiple genomic locations => features_spread.json.

So I took a look at the Sequence Ontology terms, and found that ORF is actually a term in it, but nothing I saw exactly matched 'nucleic acid'. Then I tried updating the json files and for feature 3, the file looks like this:

{ "_comment": "level3 features have parents, but no child. ( cds => exon mean that cds is included into an exon)", "cds":"exon",
"exon":"1", "five_prime_utr":"exon", "intron":"1",
"non_canonical_five_prime_splice_site":"1",
"non_canonical_three_prime_splice_site":"1", "selenocysteine":"1",
"sig_peptide":"exon", "start_codon":"exon", "stop_codon":"exon",
"stop_codon_read_through":"exon", "three_prime_utr":"exon",
"tss":"exon", "transcription_end_site":"exon", "tts":"exon", "3utr":"exon", "utr":"exon", "5utr":"exon" }

Would it make sense to add "ORF":"exon, and "nucleic_acid":"exon"?

ADD REPLY • link 2.9 years ago by acastill • 0

0

Entering edit mode

If your file does contains only ORF and nucleic_acid and they are independant, the way to go is to put them in the features_level1.json in this way:

"ORF ":"standalone",
"nucleic_acid    ":"standalone",

Then it will be processed properly i.e: duplicated nucleic_acid will be removed and ORFs would have uniq identifier

ADD REPLY • link 2.8 years ago by Juke34 8.5k