Question: How to get similar naming protein seqs and gff from downloaded genome files
gravatar for T_18
3 months ago by
T_1840 wrote:

Hi all,

Dear all,

I am currently facing an issue for which a student of mine and myself have tried to find a solution (I don’t think we are the only ones with this problem). But we have not found a solution ourselves yet. It might be a good idea to get some advice from the experts: you!

For a genome synteny analysis we need for a bunch of species a protein sequence file (.fasta) and the accompanying bed file (.bed) that describe the location of these peptides in the genome. The bed file should have the following columns: (1) chromosome/scaffold, (2) gene name, (3) start, (4) stop All this information is in the gff file and I have no problems extracting this. The issue is however that for a majority of the >50 insect species for which I need this data the names of the protein sequences in the fasta file are not similar to the names of the protein sequences given in the gff file. However, this should be the case and I also need this to let the analysis run without issues. Probably the protein sequence names are not similar to the names given in the gff due to downstream analysis by the various researchers?

What do you think is the most efficient way of getting all the proteincoding genes and pseudogenes present in the genome in a protein fasta file with the exact same name as given in the gff (or bed) file that I need?

A wrong example like it is now, as downloaded:

A sequence name from the .fasta file:

>HEL_007193-RA heliconius_erato_lativitta_v3_core_32_85_1 protein 

A “gene line” from the gff file shows that the protein sequence names are different:

Hel_chr2_13 2993800 2996321 HEL_008367  .   -   maker   gene    . ID=HEL_008367;Name=HEL_008367;Alias=maker-Hel_chr2_13-snap-gene-30.30;

A correct example:

.fasta file:


.bed file:

Chr1 AT1G50680 18777601 18778614
Chr1 AT1G50690 18779606 18780693

So in short: how do I get to a similar naming for both the protein sequence file and the gff/bed file? Is the only solution by extracting protein sequences using only the gff from the genome file? if so, how? Or are there other ways?

Thanks so much in advance!

unix gff bedtools genome • 167 views
ADD COMMENTlink written 3 months ago by T_1840

Is this data in GenBank or are these private files that you obtained from somewhere else?

ADD REPLYlink modified 3 months ago • written 3 months ago by GenoMax92k

Unfortunately this is not data of just a single source but it is mixed from various public databases or supplementary datafiles of papers. Thats also why some part of the datafiles are perfectly fine and protein files have similar naming as the GFF but unfortunately for a very large part the names are not identical.

ADD REPLYlink written 3 months ago by T_1840

@genomax And just to add, because it is from various sources the files have there own specific issues. So a simple rewriting solution for one, does not help for the other. Might be the best to simply extract proteins from scratch using the gff file?

ADD REPLYlink written 3 months ago by T_1840

If the source of these varies then that may be the solution. Solutions mentioned here are worth a look: How to get proteins from GFF file resulted from MAKER annotation

ADD REPLYlink written 3 months ago by GenoMax92k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1929 users visited in the last hour