Question

Extract protein sequences or names based on GFF and protein file

0

Entering edit mode

4.0 years ago

T_18 ▴ 50

Dear all,

In short, my gene ID's of the gff file do not correspond with the protein headers while I do need this for an analysis. What can I do to make sure the protein headers are identical to the gene ID's in the GFF file?

I am doing a genome comparison of circa 40 species, for which I want to do a microsynteny analysis. For each species I need as input a bed file only containing the gene information and the protein seq file.

75% of the GFF files is not correct and the issue is not the same for all the files. Would it be possible to rename the gene ID's in the GFF file based on all e.g. CDS's given? And in case even the CDS's IDs are not correct I suppose the only way is to extract the genes based on the GFF gene coordinates and the full genome assembly (this is what I am trying to avoid).

The problem I am facing is that the gene IDs do not correspond with the protein IDs from the seq file:

GFF file example:

    LADJ01009471.1  Genbank gene    1   4718    .   +   .   ID=gene747;Name=RR48_00748;description=Ethanolaminephosphotransferase 1;end_range=4718%2C.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=RR48_00748;partial=true;stable_id=RR48_00748;start_range=.%2C1
LADJ01009471.1  Genbank mRNA    1   4718    .   +   .   ID=rna747;Parent=gene747;end_range=4718%2C.;gbkey=mRNA;partial=true;product=Ethanolaminephosphotransferase 1;stable_id=KPJ20932.1;start_range=.%2C1;translation_stable_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 1   85  .   +   0   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 1689    1842    .   +   2   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 2856    3110    .   +   1   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 4502    4718    .   +   1   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1

Protein header corresponding to this part:

>KPJ20932.1 papilio_machaon_papma1_core_32_85_1 protein Ethanolaminephosphotransferase 1

GFF unix • 1.6k views

ADD COMMENT • link 4.0 years ago by T_18 ▴ 50

0

Entering edit mode

You can grep for protein_id and use Entrez Direct to download protein sequences. If you have many thousands of proteins to work with and this approach is slow, you can download the protein sequences for the entire assembly separately and then use something like seqkit or bedtools to extract fasta based on protein_id.

ADD REPLY • link 4.0 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

thanks for your quick reply! I also have several genome files which are not available via Entrez. I guess the best and most efficient would than be to use seqkit/bedtools also to extract the genes based on the original gff file and genome assembly file?

There is not a more efficient way where I can use only this protein and gff files?

ADD REPLY • link 4.0 years ago by T_18 ▴ 50