Query regarding extraction of GTF file
0
0
Entering edit mode
9 weeks ago
abhisek061 ▴ 30

I have the following gtf file I want to extract GO id, and associted protein ID of some specific gene id (e.g. - NZ_CP072122.1, etc.) I want to run GO enrichment analysis and KEGG mapping analysis can anyone help me to write some code?

I have differential gene expression data I fetched out genes that are upregulated and down-regulated. As I have the annotation file of this genome and sequences in FASTA format for protein.

Next, I did convert protein ids with BlastKOALA (conversion tool in KEGG) into associated KEGG IDs to map my differentially expressed genes into pathways. That's why I need to extract gene id, GO id, and associated protein ID from the GTF file.

Next, I know some languages beginning level eg. R, bash scripting, python.

If anyone can help me a little more please suggest me some good blog/articles/posts regarding Kegg mapping and GO analysis. Thanks in advance.

NZ_CP072122.1   RefSeq  gene    25092   26945   .   +   .   ID=gene-J5P21_RS00130;Name=J5P21_RS00130;gbkey=Gene;gene_biotype=protein_coding;locus_tag=J5P21_RS00130;old_locus_tag=J5P21_00130
NZ_CP072122.1   Protein Homology    CDS 25092   26945   .   +   0   ID=cds-WP_001278225.1;Parent=gene-J5P21_RS00130;Dbxref=Genbank:WP_001278225.1;Name=WP_001278225.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997422.1;locus_tag=J5P21_RS00130;product=ferrous iron transporter B;protein_id=WP_001278225.1;transl_table=11
NZ_CP072122.1   RefSeq  gene    26966   27217   .   +   .   ID=gene-J5P21_RS00135;Name=J5P21_RS00135;gbkey=Gene;gene_biotype=protein_coding;locus_tag=J5P21_RS00135;old_locus_tag=J5P21_00135
NZ_CP072122.1   Protein Homology    CDS 26966   27217   .   +   0   ID=cds-WP_000942501.1;Parent=gene-J5P21_RS00135;Dbxref=Genbank:WP_000942501.1;Name=WP_000942501.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997423.1;locus_tag=J5P21_RS00135;product=hypothetical protein;protein_id=WP_000942501.1;transl_table=11
NZ_CP072122.1   RefSeq  gene    27378   28724   .   +   .   ID=gene-J5P21_RS00140;Name=murD;gbkey=Gene;gene=murD;gene_biotype=protein_coding;locus_tag=J5P21_RS00140;old_locus_tag=J5P21_00140
NZ_CP072122.1   Protein Homology    CDS 27378   28724   .   +   0   ID=cds-WP_045544631.1;Parent=gene-J5P21_RS00140;Dbxref=Genbank:WP_045544631.1;Name=WP_045544631.1;Ontology_term=GO:0009252,GO:0008764;gbkey=CDS;gene=murD;go_function=UDP-N-acetylmuramoylalanine-D-glutamate ligase activity|0008764||IEA;go_process=peptidoglycan biosynthetic process|0009252||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997424.1;locus_tag=J5P21_RS00140;product=UDP-N-acetylmuramoyl-L-alanine--D-glutamate ligase;protein_id=WP_045544631.1;transl_table=11
NZ_CP072122.1   RefSeq  gene    28749   29945   .   +   .   ID=gene-J5P21_RS00145;Name=ftsW;gbkey=Gene;gene=ftsW;gene_biotype=protein_coding;locus_tag=J5P21_RS00145;old_locus_tag=J5P21_00145
NZ_CP072122.1   Protein Homology    CDS 28749   29945   .   +   0   ID=cds-WP_000907680.1;Parent=gene-J5P21_RS00145;Dbxref=Genbank:WP_000907680.1;Name=WP_000907680.1;Ontology_term=GO:0009252,GO:0051301,GO:0003674,GO:0016020;gbkey=CDS;gene=ftsW;go_component=membrane|0016020||IEA;go_function=molecular_function|0003674||IEA;go_process=peptidoglycan biosynthetic process|0009252||IEA,cell division|0051301||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997425.1;locus_tag=J5P21_RS00145;product=putative lipid II flippase FtsW;protein_id=WP_000907680.1;transl_table=11
NZ_CP072122.1   RefSeq  gene    29995   30927   .   -   .   ID=gene-J5P21_RS00150;Name=gluQRS;gbkey=Gene;gene=gluQRS;gene_biotype=protein_coding;locus_tag=J5P21_RS00150;old_locus_tag=J5P21_00150
NZ_CP072122.1   Protein Homology    CDS 29995   30927   .   -   0   ID=cds-WP_000216745.1;Parent=gene-J5P21_RS00150;Dbxref=Genbank:WP_000216745.1;Name=WP_000216745.1;Ontology_term=GO:0043039,GO:0004812;gbkey=CDS;gene=gluQRS;go_function=aminoacyl-tRNA ligase activity|0004812||IEA;go_process=tRNA aminoacylation|0043039||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:YP_004997426.1;locus_tag=J5P21_RS00150;product=tRNA glutamyl-Q(34) synthetase GluQRS;protein_id=WP_000216745.1;transl_table=11
NZ_CP072122.1   RefSeq  gene    30930   31466   .   -   .   ID=gene-J5P21_RS00155;Name=dksA;gbkey=Gene;gene=dksA;gene_biotype=protein_coding;locus_tag=J5P21_RS00155;old_locus_tag=J5P21_00155

gtf gff • 665 views
2
Entering edit mode

what did you try so far ?

0
Entering edit mode

Thanks for your response, actually I have differential gene expression data I fetched out genes that are upregulated and down-regulated. As I have the annotation file of this genome and sequences in FASTA format for protein.

Next, I did convert protein ids with BlastKOALA (conversion tool in KEGG) into associated KEGG IDs to map my differentially expressed genes into pathways. That's why I need to extract gene id, GO id, and associated protein ID from the GTF file.

Next, I know some languages beginning level eg. R, bash scripting, python.

If anyone can help me a little more please suggest me some good blog/articles/posts regarding Kegg mapping and GO analysis. Thanks in advance.

1
Entering edit mode

abhisek061 why did you delete the post?

0
Entering edit mode

It is about three days since no one responds to my query and it was showing that my post become red I thought no one can see it that's why I created a new post and deleted that post.

0
Entering edit mode

I've deleted your other post. Please do not open multiple posts for the same topic.

0
Entering edit mode

What coding languages would you prefer? There are methods for extracting the gene and protein IDs, but KEGG and GO analysis are entirely different efforts. This is similar to "I have a key, can you provide me with a car and show me how to drive?" Are there KEGG and GO resources for this organism? You can't do GO enrichment analysis with a single gtf file on its own.

0
Entering edit mode

Sir, please check my response on top of this post I tried to make you understand what I want to do..

0
Entering edit mode

Have you tried reading the gtf file with python and gtf_parse? That should give you a data structure with columns for each of the attributes. But if your gtf file is non-standard or problematic, you could simply use python to read it in an parse the 9th field for what you need.