Question

trying to do some GSEA in python

0

Entering edit mode

4 weeks ago

aleksk779 • 0

Hello, I'm having some major chaos with the output table generated and then passing that to my plot function. New to this (1st time), could somebody review this piece of code and suggest some corrections. I saved the output of a data frame as a txt file and it does not look good at all. The numbers and columns are there, but they follow some sort of strange wa, like a staircase.

# Create DataFrame from significant results
df_go = pd.DataFrame(
    [(
        r.GO,
        r.goterm.name,
        r.goterm.namespace,
        r.p_uncorrected,
        len(r.study_items),  # Number of study genes
        GO_items.count(r.GO),
        ', '.join([rev_mapper[y] for y in r.study_items])  # Concatenate study genes into a single string
    ) for r in goea_results_sig],
    columns=['GO', 'term', 'p', 'n_genes', 'n_study', 'n_go', 'study_genes']
)

print(df_go)
# Save DataFrame to a text file (CSV format)
df_go.to_csv('df_go_results.txt', sep='\t', index=False)

I can provide more code, because when I tried to commit and push from IDE onto github I got the following message.

Successfully created project 'GSEAplots' on GitHub, but initial commit failed: inflate: data stream error (incorrect data check) corrupt loose object 'a87848b96903648398f723df040fd0be540a025d' unable to read a87848b96903648398f723df040fd0be540a025d

No idea why or what ? Git does not seem to interact with PyCharm and/ or Github really well ?

The plots code :

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,4),dpi=300)
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.rcParams["font.size"] = 18
plt.rcParams["font.family"] = "Helvetica"


plt.scatter(df_p['Z'],-np.log10(df_p['P(sgRNAs)']),s=18,color='grey',alpha=0.3,linewidths=0.3,edgecolors='grey')


label_ls1 = list(df_up['Gene'])

df_show1 = df_p.loc[label_ls1,:]
lfc_ls1 = list(df_show1['Z']); p_ls1 = list(-np.log10(df_show1['P(sgRNAs)']))

plt.scatter(lfc_ls1,p_ls1,s=20,color='red',linewidths=0.5,edgecolors='grey')

for i, txt in enumerate(label_ls1[-4:]):
    plt.annotate(txt, (lfc_ls1[-4:][i]+0.03, p_ls1[-4:][i]-0.08),fontsize=8)

label_ls2 = list(df_down['Gene'])

df_show2 = df_p.loc[label_ls2,:]
lfc_ls2 = list(df_show2['Z']); p_ls2 = list(-np.log10(df_show2['P(sgRNAs)']))

plt.scatter(lfc_ls2,p_ls2,s=20,color='blue',linewidths=0.5,edgecolors='grey')

for i, txt in enumerate(label_ls2[1:5]):
    plt.annotate(txt, (lfc_ls2[1:5][i]+0.03, p_ls2[1:5][i]-0.08),fontsize=8)

plt.axhline(y=1.3, linestyle='--', color='lightgray') 
plt.axvline(x=-2, linestyle='--', color='lightgray') 
plt.axvline(x=2, linestyle='--', color='lightgray') 

plt.xlabel('Differential score (PRMT5/DMSO)')
plt.ylabel('-log10(p-value)')


#plt.savefig('DifferentialEssential_volcanoplot.png',dpi=300,facecolor='white')
plt.show()

Data frame that I am trying to plot has the following structure :

             total_samples  occurrence_total  occurrence_CRSsNP  \

ENSG00000100592 15 15 8
ENSG00000101336 15 14 8
ENSG00000134285 15 15 8

             occurrence_CTL  sparsity  rlog_av_CTL  rlog_av_CRSsNP  \

ENSG00000100592 7 0.099193 11.709476 11.413235
ENSG00000101336 6 0.322330 4.084709 6.179892
ENSG00000134285 7 0.098323 9.169283 9.595925

            rlog_Regulation  vst_av_CTL  vst_av_CRSsNP vst_Regulation  \

ENSG00000100592 CTL 11.790292 11.367062 CTL
ENSG00000101336 CRSsNP 4.909465 7.051789 CRSsNP
ENSG00000134285 CRSsNP 9.142557 9.734598 CRSsNP

              raw_av_CTL  raw_av_CRSsNP Raw_Regulation structural_zero  \

ENSG00000100592 3973.285714 2428.500 CTL No
ENSG00000101336 10.285714 119.750 CRSsNP No
ENSG00000134285 602.857143 761.625 CRSsNP No

             DESeq2Norm_av_CTL  DESeq2Norm_av_CRSsNP DESeq_Regulation  \

ENSG00000100592 3526.244455 2614.486202 CTL
ENSG00000101336 9.293777 131.796045 CRSsNP
ENSG00000134285 535.929110 828.762618 CRSsNP

                baseMean  log2FoldChange     lfcSE      stat  \

ENSG00000100592 3039.973387 -0.431423 0.076683 -5.626030
ENSG00000101336 74.628320 3.828859 0.605862 6.319689
ENSG00000134285 692.106981 0.628415 0.105894 5.934376

             pvalue_DESEq2  padj_DESEq2        pvalue      padj  \

ENSG00000100592 1.844041e-08 0.000120 3.694635e-06 0.038127
ENSG00000101336 2.620902e-10 0.000005 2.009572e-07 0.006221
ENSG00000134285 2.949646e-09 0.000029 1.052822e-06 0.016297

             errorBarUp  errorBarLow Final_Regulation external_gene_name

ENSG00000100592 -0.354739 -0.508106 CTL DAAM1
ENSG00000101336 4.434721 3.222997 CRSsNP HCK
ENSG00000134285 0.734309 0.522521 CRSsNP FKBP11

I tried to push onto Github - unknown error. Also, the data frame would not plot because of discrepancy in the structure of the data frames or the class which gets the df passed onto. ?? Any suggestions, the p-value filters can be adjusted. What I am trying to plot or get at is in bold. Much thanks for your ideas.

**Runing BP Ontology Analysis: current study set of 3 IDs.
100%      3 of      3 study items found in association
100%      3 of      3 study items found in population(20608)
Calculating 12,180 uncorrected p-values using fisher_scipy_stats
  12,180 terms are associated with 17,024 of 20,608 population items
      33 terms are associated with      3 of      3 study items**
  METHOD fdr_bh:
       0 GO terms found significant (< 0.1=alpha) (  0 enriched +   0 purified): statsmodels fdr_bh
       0 study items associated with significant GO IDs (enriched)
       0 study items associated with significant GO IDs (purified)

**Runing CC Ontology Analysis: current study set of 3 IDs.
100%      3 of      3 study items found in association
100%      3 of      3 study items found in population(20608)
Calculating 1,801 uncorrected p-values using fisher_scipy_stats
   1,801 terms are associated with 18,139 of 20,608 population items
      18 terms are associated with      3 of      3 study items**
  METHOD fdr_bh:
       0 GO terms found significant (< 0.1=alpha) (  0 enriched +   0 purified): statsmodels fdr_bh
       0 study items associated with significant GO IDs (enriched)
       0 study items associated with significant GO IDs (purified)

**Runing MF Ontology Analysis: current study set of 3 IDs.
100%      3 of      3 study items found in association
100%      3 of      3 study items found in population(20608)
Calculating 4,602 uncorrected p-values using fisher_scipy_stats
   4,602 terms are associated with 17,598 of 20,608 population items
      11 terms are associated with      3 of      3 study items**
  METHOD fdr_bh:
       0 GO terms found significant (< 0.1=alpha) (  0 enriched +   0 purified): statsmodels fdr_bh
       0 study items associated with significant GO IDs (enriched)
       0 study items associated with significant GO IDs (purified)

python GSEA GO • 277 views

ADD COMMENT • link updated 28 days ago by Wayne ★ 2.0k • written 4 weeks ago by aleksk779 • 0

0

Entering edit mode

What are you showing the output of there? The result of df_go.to_csv('df_go_results.txt', sep='\t', index=False) or the result of print(df_go). You'd need to provide more to go on just starting there. The stuff about git doesn't belong here. That is a different question for a different forum. (Indeed you can just use the browser interface with Github if you really cannot push from the command line/IDE. Or make a gist.) You just need to make a toy example that read in data to a dataframe and results in the same issue as you are seeing. We don't need all your data or actual data. Please learn about sharing fenced code block in forums such as this. One example is here. (Ideally specify the language, too so that it has syntax highlighting, see here.) Right now your example code that begins import seaborn as sns is unreadable and unuseable. Here is an example of you can share what is stored by something like df_go.to_csv('df_go_results.txt', sep='\t', index=False) so we can use it. (That example is written in regards to a Jupyter notebook but is valid Python code, too.) Ideally, you'd already have the steps to go from the toy set of data spelled out as part of a minimal MRE that results in what you are seeing as your issue.

ADD REPLY • link 4 weeks ago by Wayne ★ 2.0k

0

Entering edit mode

It seems you updated this some more yesterday. The code formatting is improved. Issues remain though with your post that make it hard to offer any guidance:

You don't give any indication of where the output that starts **Runing BP Ontology Analysis: current study set of 3 IDs is coming from? What was the command that resulting in that?
However, I cannot see how what what you share under "Data frame that I am trying to plot has the following structure :" relates to df_go. Is that the output of df_go.to_csv('df_go_results.txt', sep='\t', index=False) or print(df_go)?
However, I cannot see how plt.scatter(df_p['Z'],-np.log10(df_p['P(sgRNAs)']),s=18,color='grey',alpha=0.3,linewidths=0.3,edgecolors='grey' in your plotting code relates to what you share under "Data frame that I am trying to plot has the following structure :". Or how it relates to other parts of your post?

ADD REPLY • link 28 days ago by Wayne ★ 2.0k