Why frequencies in goProfiles are not the same with a sliced dataframe ?
1
0
Entering edit mode
10.0 years ago
arnome • 0

I've build a specific DataFrame with python pandas to compute ontology frequencies with goProfiles in bioconductor. I use the basicProfile function with option 'GOTermsFrame' but without the optional column 'Evidence'. I've got one big dataframe as follow :

In [1]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 119626 entries, 0 to 119625
Data columns (total 3 columns):
GeneID      119626 non-null object
GOID        119626 non-null object
Ontology    119626 non-null object
dtypes: object(3)

So, almost 120000 entries divided with Ontology as follow :

In [2]: df.groupby(['Ontology'])['Ontology'].count()
Ontology
BP          58802
CC          26867
MF          33957

When I compute goProfile with any three Ontology at level 2, I get this frequencies :

In [3]: rdf = com.convert_to_r_dataframe(df)
In [4]: %%R -i rdf
> library(goProfiles)
> rdf <- as.data.frame(rdf)
> print(head(rdf))
                GeneID       GOID Ontology
0 VIT_201s0011g00010.1 GO:0043565       MF
1 VIT_201s0011g00010.1 GO:0003964       MF
2 VIT_201s0011g00010.1 GO:0006278       BP
3 VIT_201s0011g00010.1 GO:0006367       BP
4 VIT_201s0011g00010.1 GO:0003743       MF
5 VIT_201s0011g00010.1 GO:0005840       CC

> profiles.ANY <- basicProfile(rdf,idType='GOTermsFrame',onto="ANY",level=2)
> printProfiles(profiles.ANY,percentage=T,aTitle="Test GO Profile")

Test GO Profile
========================
[1] "MF ontology"
                    Description       GOID Frequency
12         antioxidant activity GO:0016209       1.0
9                       binding GO:0005488      75.0
4            catalytic activity GO:0003824      65.1
1  electron carrier activity... GO:0009055       3.5
15 enzyme regulator activity... GO:0030234       1.6
21 molecular transducer acti... GO:0060089       3.1
3  nucleic acid binding tran... GO:0001071       2.8
6  nutrient reservoir activi... GO:0045735       0.5
2  protein binding transcrip... GO:0000988       0.1
5             receptor activity GO:0004872       1.2
7  structural molecule activ... GO:0005198       2.8
8          transporter activity GO:0005215       8.2
[1] "BP ontology"
[1] Description GOID        Frequency
<0 lignes> (ou 'row.names' de longueur nulle)
[1] "CC ontology"
[1] Description GOID        Frequency
<0 lignes> (ou 'row.names' de longueur nulle)

So, neither BP or CC Ontology is show up.

But when I take a slice of 500 rows of this big dataframe and compute the same ways (any ontology, level=2), I get this :

In [5]: dft = df[0:500]
In [6]: rdft = com.convert_to_r_dataframe(dft)
In [7]: %%R -i rdft
> profs.ANY <- basicProfile(rdf,idType='GOTermsFrame',onto="ANY",level=2)
> printProfiles(profiles.ANY,percentage=T,aTitle="Test GO Profile")
Test Profile
============
[1] "MF ontology"
                   Description       GOID Frequency
9                      binding GO:0005488      77.8
4           catalytic activity GO:0003824      49.2
1 electron carrier activity... GO:0009055       3.2
3 nucleic acid binding tran... GO:0001071       1.6
7 structural molecule activ... GO:0005198       1.6
8         transporter activity GO:0005215      12.7
[1] "BP ontology"
[1] Description GOID        Frequency
<0 lignes> (ou 'row.names' de longueur nulle)
[1] "CC ontology"
                  Description       GOID Frequency
3                        cell GO:0005623      93.4
6               cell junction GO:0030054       3.3
17                  cell part GO:0044464      93.4
2        extracellular region GO:0005576       8.2
9   macromolecular complex... GO:0032991      21.3
1                    membrane GO:0016020      34.4
8  membrane-enclosed lumen... GO:0031974       3.3
15              membrane part GO:0044425      19.7
4                    nucleoid GO:0009295       1.6
10                  organelle GO:0043226      75.4
13             organelle part GO:0044422      21.3
19                   symplast GO:0055044       3.3

It's really difficult to understand why :

  • there is no BP Ontology frequencies in both case whereas thereis 58802 genes with BP ontology in the main frame and 234 in the short one
  • there is CC Ontology frequencies in short frame and not at all in the main frame whereas the shorter is the (little) first part of the big one.

Can the level (2 in this case) can explain this major differences ? Or I make mistake somewhere ?

Thank's a lot,

arnome.

R bioconductor • 2.4k views
ADD COMMENT
0
Entering edit mode

It's an old topic but did you find a solution ? I have exactly the same problem

ADD REPLY
0
Entering edit mode
6.9 years ago
arnome • 0

No, sorry, I didn't find a solution. arnome.

ADD COMMENT

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6