Question: Rebase analysed GFF3 against parent data
1
gravatar for rasche.eric
4.2 years ago by
rasche.eric70
United States
rasche.eric70 wrote:

I have a gff3 file produced by an analysis step (specifically InterPro, but that's not terribly relevant here). Since that tool took a fasta file of proteins, all of the analysis results have coordinates respective to the analysed protein sequences

I'm trying to get these results to show up in a browser like JBrowse, the easiest way of doing this I found was to rebase the coordinates against the parent genome. E.g. if there was a `match_part` from 1-100 of cdsA which is comprised of bases 200..300, then we'd update the `match_part` to be 200..300, and change the parent reference to the parent genome. 

I have a small tool that does this, but was wondering if anyone has a better solution (I just want to display them properly in JBrowse), or fully featured existing implementation of a rebasing tool like this?

Example

I have a gff file with my gene calls, like so:

##gff-version 3
##sequence-region Merlin 1 172788
Merlin	GeneMark.hmm	gene	2	691	-856.563659	+	.	ID=Merlin_1
Merlin	GeneMark.hmm	gene	1067	2011	-1229.683915	-	.	ID=Merlin_3

From this, those gene sequences were extracted, translated to protein sequences, and then run through some analysis step which generated some results/matches. In this case they're InterProScan results

Merlin	feature	polypeptide	1	229	.	+	.	ID=Merlin_1
Merlin	Gene3D	protein_match	2	50	2.9E-21	+	.	ID=match%2477_2_50;Name=G3DSA:3.90.176.10;Target=Merlin_1 2 50;date=23-02-2015;status=T

In order to have these results visible, properly, in JBrowse, those coordinates need to be adjusted such that they reflect their coordinates respective to the parent genome.

The feature with ID=Merlin_1, should be moved 1 base to the right, as the gene that was analysed to produce that match starts at base 2.

A hit with ID=Merlin_3, going from bases 1..11 (in the InterPro results) would need to be changed to the minus strand, and moved to 2000..2011 according to the parent genome.

gff3 tool software gene • 1.3k views
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by rasche.eric70
0
gravatar for Daniel Standage
4.2 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

UPDATE: It turns out I misunderstood the original question. The response below is for transforming all coordinates for a sequence uniformly.

The gt gff3 command in the GenomeTools library has an -offset option that allows you to specify offsets as you have described. This will apply the same offset to all the data, or alternatively if you want to specify offsets for each sequence you can use the -offsetfile option.

 

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Daniel Standage3.9k

Thanks! I'd looked at GT a while back, but didn't know about the offsetfile option.

Do you know if it handles strandedness? E.g. analysed feature is minus strand, 1000-1200, match_part is 1-100, so the final location should be minus strand, 1100-1200

ADD REPLYlink written 4.2 years ago by rasche.eric70

Now I'm not so sure I'm thinking about the same thing you are. Perhaps a couple of examples would help clarify things.

ADD REPLYlink written 4.2 years ago by Daniel Standage3.9k

Updated my post with a more descriptive example.

ADD REPLYlink written 4.2 years ago by rasche.eric70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 938 users visited in the last hour