How can I merge GFF files together to produce a file with gene functions from both?
1
1
Entering edit mode
2.2 years ago
James ▴ 20

Hello all, I currently have a GFF file with functional annotations for predicted genes in a metagenome. I have used another tool to predict functions for the same set of genes which produced a separate GFF file, and I am hoping that I can combine these two files into one that stores information on both gene functions. For example, say that I have a gene annotated as such in file A:

VMAG_100.1  PhATE   CDS 1   399 .   -   .   ID=VMAG_100.1_consensus_49_geneCall_cds; annot1=(hmm search - jackhmmer) gi|966201526|ref|YP_009191714.1| hypothetical protein T12_45 [Streptococcus phage T12]

And I have the same gene with a different annotation in file B:

VMAG_100.1  PhATE   CDS 1   399 .   -   .   ID=VMAG_100.1_consensus_49_geneCall_cds; annot1=(hmm search - jackhmmer) gi|389060239|ref|YP_006383371.1| hypothetical protein TSMG0091 [Halocynthia phage JM-2012]

How might I produce a consensus file with an output like this:

VMAG_100.1  PhATE   CDS 1   399 .   -   .   ID=VMAG_100.1_consensus_49_geneCall_cds; annot1=(hmm search - jackhmmer) gi|966201526|ref|YP_009191714.1| hypothetical protein T12_45 [Streptococcus phage T12]; annot2=(hmm search - jackhmmer) gi|389060239|ref|YP_006383371.1| hypothetical protein TSMG0091 [Halocynthia phage JM-2012];

Adding the functions manually won't work well considering I have thousands of genes in each of these files. Thank you in advance for the help!

merge protein gff function • 2.2k views
ADD COMMENT
1
Entering edit mode

Check AGAT toolkit (LINK). There should be something in there to do this.

ADD REPLY
0
Entering edit mode

This worked great for me, thank you

ADD REPLY
2
Entering edit mode
2.2 years ago
Juke34 8.6k

The default agat_sp_merge_annotations.pl script was not merging the attributes. When two loci were overlapping, It was adding different isoforms, removing identical isoforms and the 2 gene features were becoming 1 by keeping one feature and modifying the start and stop properly (so attributes from 2nd gene was lost). I made an update in the merge branch where attributes from the genes are now merged as well as from identical isoforms.

Otherwise you use one annotation as reference and add information from the second using a tsv with agat_sq_add_attributes_from_tsv.pl. To prepare the tsv you can first run agat_convert_sp_gff2tsv.pl on the second annotation and then filter the column to keep the column ID first along with all the attributes you want to update.

I would suggest to first use a sed command to transform your cds from the third column into centromere before to perform the work and then do the convertion back with sed at the end. Why? Because apparently you only have CDS in your annotation and AGAT will create mRNA and Gene features with random ID. So you might loose link between the 2 annotations (and the original CDS feature are not merged if you use the first appraoch). centromere is suppposed to be "standalone" so no extra feature will be created.

ADD COMMENT
0
Entering edit mode

This was good advice, both of these methods work for me, thank you. I have an additional request here Concensus functionall annotation from multiple, non-redundant annotations in GFF and Genbank format if you are able to help again, thanks in advance.

ADD REPLY
0
Entering edit mode

This worked for me as well, thanks! In case others face this issue in the future, you may find more details on this thread.

ADD REPLY

Login before adding your answer.

Traffic: 2553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6