I am currently facing an issue for which a student of mine and myself have tried to find a solution (I don’t think we are the only ones with this problem). But we have not found a solution ourselves yet. It might be a good idea to get some advice from the experts: you!
For a genome synteny analysis we need for a bunch of species a protein sequence file (.fasta) and the accompanying bed file (.bed) that describe the location of these peptides in the genome. The bed file should have the following columns: (1) chromosome/scaffold, (2) gene name, (3) start, (4) stop All this information is in the gff file and I have no problems extracting this. The issue is however that for a majority of the >50 insect species for which I need this data the names of the protein sequences in the fasta file are not similar to the names of the protein sequences given in the gff file. However, this should be the case and I also need this to let the analysis run without issues. Probably the protein sequence names are not similar to the names given in the gff due to downstream analysis by the various researchers?
What do you think is the most efficient way of getting all the proteincoding genes and pseudogenes present in the genome in a protein fasta file with the exact same name as given in the gff (or bed) file that I need?
A wrong example like it is now, as downloaded:
A sequence name from the .fasta file:
>HEL_007193-RA heliconius_erato_lativitta_v3_core_32_85_1 protein MGNVKTLFCTLRPEVCTNKVAIVLGGLPGVTSETRAERPYFDDVSPRNVSAVVGQAAVLRCRAKHTGNRTVSWMRKRDLHILTSHIFTYTGDARFSVLHPEPSDDWDLKIDYVQPRDAGVYECQINTEPKINMAVMLNVEAAAASIWGSQDVYVKKGSTISLTCSVNVHSSPPSSASVLWYHGNAVVDFDSPRGGISLETEKTEGGTTSKLLVTKAALTDSGNYTCVPNNAHPASNILNKSTYVGTPKDK
A “gene line” from the gff file shows that the protein sequence names are different:
Hel_chr2_13 2993800 2996321 HEL_008367 . - maker gene . ID=HEL_008367;Name=HEL_008367;Alias=maker-Hel_chr2_13-snap-gene-30.30;
A correct example:
>AT1G50680 MRLDDEPENALVVSSSPKTVVASGNVKYKGVVQQQNGHWGAQIYADHKRIWLGTFKSADEAATAYDSASIKLRSFDANSHRNFPWSTITLNEPDFQNCYTTETVLNMIRDGSYQHKFRDFLRIRSQIVASINIGGPKQARGEVNQESDKCFSCTQLFQKELTPSDVGKLNRLVIPKKYAVKYMPFISADQSEKEEGEIVGSVEDVEVVFYDRAMRQWKFRYCYWKSSQSFVFTRGWNSFVKEKNLKEKDVIAFYTCDVPNNVKTLEGQRKNFLMIDVHCFSDNGSVVAEEVSMTVHDSSVQVKKTENLVSSMLEDKETKSEENKGGFMLFGVRIECP* >AT1G50690 MDPQVVVDKKSEEPDLKRQKLEEEEEEDCEEMSSYSESTCSFDSEDERLVEEEYQRSGYYDFDTTKQRRLVFCYPVIFEDSDVAHKPETDGDLVHRLSKIALQKYNDDKLENLELVRAVKANRKYGAGFIFYITFEAKDANSHTDPITFQAAVRYLRGIETVYRVHPKPLLDSTK*
Chr1 AT1G50680 18777601 18778614 Chr1 AT1G50690 18779606 18780693
So in short: how do I get to a similar naming for both the protein sequence file and the gff/bed file? Is the only solution by extracting protein sequences using only the gff from the genome file? if so, how? Or are there other ways?
Thanks so much in advance!