Question

Enrichment analysis in the plant proteins

0

Entering edit mode

6 months ago

Vedikaa Dhiman • 0

I am interested in performing enrichment analysis on disease-resistance plant proteins. I also have all the protein IDs and sequences extracted from the RiceRelativesGD V4.0 database. However, I think they have changed the fasta headers with their own assigned protein IDs, due to which I am having issues in performing the following:

In silico gene expression analysis.
Phylogeny analysis
Construction of protein-protein network in STRING Database

In the above points, such as in 1st, the gene expression analysis requires gene IDs, which can be automatically fetched by the Gene Expression Atlas, but due to the protein IDs and some species not present in the Atlas, I am unable to do the analysis. Similarly, in 2nd, for phylogeny analysis, I need Multiple Sequence Alignment, which I am unable to do as during P-BLAST, it is not showing any major hits. In addition, in 3rd point, the STRING database requires species names which, in my case, I am unable to find. Kindly suggest some other tools or ways to do the enrichment analysis on such proteins. The following are the IDs and their respective species:

scaffold41.166
OGLUM01G32470
ONIVA01G41900
OPUNC01G35650
scaffold76.301
LPERR12G17120
ORUFI01G40100
OMERI01G33870
OPUNC12G16300
OsR498G0203451400.01
scaffold29.669
scaffold15.384
Zlat_10044396
Zlat_10027152
ONIVA12G18960
scaffold25.271
Chr2_RaGOO.1323.mRNA1
Chr1_RaGOO.4384.mRNA1
OB06G28290
Zlat_10021111

*Echinochloa crus galli
Oryza glumaepatula
Oryza nivara
Oryza punctata
Echinochloa crus galli
Leersia perrieri
Oryza rufipogon
Oryza meridionalis
Oryza punctata
Oryza sativa subsp Indica
Echinochloa crus galli
Echinochloa crus galli
Zizania latifolia
Zizania latifolia
Oryza nivara
Echinochloa crus galli
Oryza sativa f spontanea indica group
Oryza sativa f spontanea indica group
Oryza branchyantha
Zizania latifolia*

gene-ontology • 619 views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 6 months ago by Vedikaa Dhiman • 0

score 0 · Answer 1 · 2023-10-30

0

Entering edit mode

6 months ago

dthorbur ★ 1.9k

It's not clear to me where the discrepancy in your dataset is coming from. Where is the protein ID clash coming from?

Regardless, there are a number of tools you can use to get around the first 2 problems.

Lots of gene expression tools will count reads mapped to a transcriptome you provide as input. Kallisto is a good example, which you can then pair with DeSeq2 for downstream analysis.
This depends on what kind of phylogeny you want to produce. It's not clear from your question what your intended outcome is. It could be a species phylogeny or it could be a series of gene trees or orthologous sequences. A tool that could be of interest is OrthoFinder as this will identify orthologous and paralogous sequences among assemblies and will emit gene trees and will also emit a species tree. If you want a more parametrizable tool, I would recommend RAxML.

I don't have any recommendations for the final point.

ADD COMMENT • link 6 months ago by dthorbur ★ 1.9k

0

Entering edit mode

The discrepancy in my data include mismatching of my protein IDs with the database (like NCBI or UniProtKB) which means that the protein IDs I have extracted from the RiceRelativesGD V4.0 database are not matching with various public databases due to which I am unable to perform in silico gene expression analysis, Phylogeny analysis and Construction of protein-protein network in STRING Database. So, I wanted to know that how can I retrieve correct IDs of these proteins? or, is there any other way to do this?

Thank you for the suggestions given by you in point numbers 1 and 2.

ADD REPLY • link 6 months ago by Vedikaa Dhiman • 0

1

Entering edit mode

Discrepancies between databases is always annoying to deal with in my experience. This is especially difficult to deal with when there are lots of different assemblies or isolates as you'd expect with agronomically important species (which I guess you are working with). They can range from just different names for invariant genes, to whole sets of expanded genes only found in one assembly.

You could still take a few assemblies and identify orthologs by clustering, but this is imperfect. I tend to stick to the best assembly I can find, and then for downstream analysis I find orthologs in other assemblies if necessary.

Not all proteins in one assembly will be present in another, and they may have expanded in a different assembly. I think a tools like OrthoFinder is still a good starting point. You could even limit downstream analyses to 1-1 orthologs with high sequence similarity to increase confidence.

And as suggested, the phylogenetics can be conducted without the STRING database. I have no experience with said database, so I can't say what you'd be missing by not using it.

ADD REPLY • link 6 months ago by dthorbur ★ 1.9k

0

Entering edit mode

Thank you for the suggestion. I am trying to execute the suggestions by you.

ADD REPLY • link 6 months ago by Vedikaa Dhiman • 0

0

Entering edit mode

Are these tools mentioned by you (Kallisto, DeSeq2 and OrthoFinder) is applicable in case of plants?

ADD REPLY • link 6 months ago by Vedikaa Dhiman • 0

0

Entering edit mode

Why wouldn't they be? Ploidy level is the only additional consideration I can think of, but as long as you are aware of how this may affect your results, it should be the same as any other dataset.

ADD REPLY • link 6 months ago by dthorbur ★ 1.9k