I have a gff3 file produced by an analysis step (specifically InterPro, but that's not terribly relevant here). Since that tool took a fasta file of proteins, all of the analysis results have coordinates respective to the analysed protein sequences
I'm trying to get these results to show up in a browser like JBrowse, the easiest way of doing this I found was to rebase the coordinates against the parent genome. E.g. if there was a `match_part` from 1-100 of cdsA which is comprised of bases 200..300, then we'd update the `match_part` to be 200..300, and change the parent reference to the parent genome.
I have a small tool that does this, but was wondering if anyone has a better solution (I just want to display them properly in JBrowse), or fully featured existing implementation of a rebasing tool like this?
I have a gff file with my gene calls, like so:
##gff-version 3 ##sequence-region Merlin 1 172788 Merlin GeneMark.hmm gene 2 691 -856.563659 + . ID=Merlin_1 Merlin GeneMark.hmm gene 1067 2011 -1229.683915 - . ID=Merlin_3
From this, those gene sequences were extracted, translated to protein sequences, and then run through some analysis step which generated some results/matches. In this case they're InterProScan results
Merlin feature polypeptide 1 229 . + . ID=Merlin_1 Merlin Gene3D protein_match 2 50 2.9E-21 + . ID=match%2477_2_50;Name=G3DSA:22.214.171.124;Target=Merlin_1 2 50;date=23-02-2015;status=T
In order to have these results visible, properly, in JBrowse, those coordinates need to be adjusted such that they reflect their coordinates respective to the parent genome.
The feature with ID=Merlin_1, should be moved 1 base to the right, as the gene that was analysed to produce that match starts at base 2.
A hit with ID=Merlin_3, going from bases 1..11 (in the InterPro results) would need to be changed to the minus strand, and moved to 2000..2011 according to the parent genome.