Question: Extract protein sequences or names based on GFF and protein file
0
gravatar for T_18
6 months ago by
T_1840
T_1840 wrote:

Dear all,

In short, my gene ID's of the gff file do not correspond with the protein headers while I do need this for an analysis. What can I do to make sure the protein headers are identical to the gene ID's in the GFF file?

I am doing a genome comparison of circa 40 species, for which I want to do a microsynteny analysis. For each species I need as input a bed file only containing the gene information and the protein seq file.

75% of the GFF files is not correct and the issue is not the same for all the files. Would it be possible to rename the gene ID's in the GFF file based on all e.g. CDS's given? And in case even the CDS's IDs are not correct I suppose the only way is to extract the genes based on the GFF gene coordinates and the full genome assembly (this is what I am trying to avoid).

The problem I am facing is that the gene IDs do not correspond with the protein IDs from the seq file:

GFF file example:

    LADJ01009471.1  Genbank gene    1   4718    .   +   .   ID=gene747;Name=RR48_00748;description=Ethanolaminephosphotransferase 1;end_range=4718%2C.;gbkey=Gene;gene_biotype=protein_coding;locus_tag=RR48_00748;partial=true;stable_id=RR48_00748;start_range=.%2C1
LADJ01009471.1  Genbank mRNA    1   4718    .   +   .   ID=rna747;Parent=gene747;end_range=4718%2C.;gbkey=mRNA;partial=true;product=Ethanolaminephosphotransferase 1;stable_id=KPJ20932.1;start_range=.%2C1;translation_stable_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 1   85  .   +   0   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 1689    1842    .   +   2   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 2856    3110    .   +   1   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1
LADJ01009471.1  Genbank CDS 4502    4718    .   +   1   Dbxref=InterPro:IPR000462,UniProtKB/Swiss-Prot:Q80TA1,NCBI_GP:KPJ20932.1;ID=cds747;Name=KPJ20932.1;Parent=rna747;gbkey=CDS;partial=true;product=Ethanolaminephosphotransferase 1;protein_id=KPJ20932.1

Protein header corresponding to this part:

>KPJ20932.1 papilio_machaon_papma1_core_32_85_1 protein Ethanolaminephosphotransferase 1
unix gff • 208 views
ADD COMMENTlink written 6 months ago by T_1840

You can grep for protein_id and use Entrez Direct to download protein sequences. If you have many thousands of proteins to work with and this approach is slow, you can download the protein sequences for the entire assembly separately and then use something like seqkit or bedtools to extract fasta based on protein_id.

ADD REPLYlink written 6 months ago by vkkodali2.2k

thanks for your quick reply! I also have several genome files which are not available via Entrez. I guess the best and most efficient would than be to use seqkit/bedtools also to extract the genes based on the original gff file and genome assembly file?

There is not a more efficient way where I can use only this protein and gff files?

ADD REPLYlink written 6 months ago by T_1840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1913 users visited in the last hour