How to get similar naming protein seqs and gff from downloaded genome files
Entering edit mode
8 months ago
T_18 ▴ 40

Hi all,

Dear all,

I am currently facing an issue for which a student of mine and myself have tried to find a solution (I don’t think we are the only ones with this problem). But we have not found a solution ourselves yet. It might be a good idea to get some advice from the experts: you!

For a genome synteny analysis we need for a bunch of species a protein sequence file (.fasta) and the accompanying bed file (.bed) that describe the location of these peptides in the genome. The bed file should have the following columns: (1) chromosome/scaffold, (2) gene name, (3) start, (4) stop All this information is in the gff file and I have no problems extracting this. The issue is however that for a majority of the >50 insect species for which I need this data the names of the protein sequences in the fasta file are not similar to the names of the protein sequences given in the gff file. However, this should be the case and I also need this to let the analysis run without issues. Probably the protein sequence names are not similar to the names given in the gff due to downstream analysis by the various researchers?

What do you think is the most efficient way of getting all the proteincoding genes and pseudogenes present in the genome in a protein fasta file with the exact same name as given in the gff (or bed) file that I need?

A wrong example like it is now, as downloaded:

A sequence name from the .fasta file:

>HEL_007193-RA heliconius_erato_lativitta_v3_core_32_85_1 protein 

A “gene line” from the gff file shows that the protein sequence names are different:

Hel_chr2_13 2993800 2996321 HEL_008367  .   -   maker   gene    . ID=HEL_008367;Name=HEL_008367;Alias=maker-Hel_chr2_13-snap-gene-30.30;

A correct example:

.fasta file:


.bed file:

Chr1 AT1G50680 18777601 18778614
Chr1 AT1G50690 18779606 18780693

So in short: how do I get to a similar naming for both the protein sequence file and the gff/bed file? Is the only solution by extracting protein sequences using only the gff from the genome file? if so, how? Or are there other ways?

Thanks so much in advance!

Genome gff UNIX bedtools • 255 views
Entering edit mode

Is this data in GenBank or are these private files that you obtained from somewhere else?

Entering edit mode

Unfortunately this is not data of just a single source but it is mixed from various public databases or supplementary datafiles of papers. Thats also why some part of the datafiles are perfectly fine and protein files have similar naming as the GFF but unfortunately for a very large part the names are not identical.

Entering edit mode

@genomax And just to add, because it is from various sources the files have there own specific issues. So a simple rewriting solution for one, does not help for the other. Might be the best to simply extract proteins from scratch using the gff file?

Entering edit mode

If the source of these varies then that may be the solution. Solutions mentioned here are worth a look: How to get proteins from GFF file resulted from MAKER annotation


Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6