Question

Stuck on calculating principal components

0

Entering edit mode

6.4 years ago

zizigolu ★ 4.3k

Hi,

I have a table of 16SrRNA counts, rowa=samples columns=bacteria

> head(otus[,1:5])

      Class_Thermobacula Class_Anaerolineae Class_Chlorobia Class_Deinococci
    IPC001                 NA                 NA              NA              NA
    IPC002           3.17e-05           0.00e+00           0e+00               0
    IPC003           0.00e+00           0.00e+00           2e-05                0
    IPC005           0.00e+00           1.31e-05           0e+00                0
    IPC007                 NA                 NA              NA              NA
    IPC008           0.00e+00           1.45e-05           0e+00                0
    IPC001                   NA
    IPC002          0.006243463
    IPC003          0.000519667
    IPC005          0.001062978
    IPC007                   NA
    IPC008          0.000607156

I am going to extract principal components (PC) but whatever I tried to calculate

  # get Bray-Curtis distances (default for Vegan)
    d.bray <- vegdist(otus)

saying that

Error in base::colMeans(x, na.rm = na.rm, dims = dims, ...) : 'x' must be numeric

I tried to removed NA by

otus[,!is.na(otus$phenotype)] or otus[!is.na(otus$phenotype),]

but everything will be removed

what is your suggestion please?

Vegan R software error SNP • 6.7k views

ADD COMMENT • link updated 6.4 years ago by pfs ▴ 280 • written 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

d.euc <- dist(otu)
pc.euc <- cmdscale(d.euc, k=2)

head(pc.euc )
             [,1]        [,2]
IPC001  0.2818391  0.43399837
IPC002 -0.5019871  0.19735612
IPC003 -1.2032946  0.11230292
IPC005 -1.2037247 -1.61211143
IPC007  1.1408847  0.09297364
IPC008 -1.4736239  0.01609752

Says that it is principal coordinates

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

score 3 · Answer 1 · 2017-11-27

3

Entering edit mode

6.4 years ago

Kevin Blighe 87k

This function, vegdist, calculates a distance matrix and does not perform PCA (?). Which did you want to perform?

It's possible to get this to run using the following:

vegdist(data.matrix(t(otus)), method="euclidean", na.rm=TRUE)
                   lass_Thermobacula Class_Anaerolineae Class_Chlorobia Class_Deinococci
Class_Anaerolineae     0.00003723909                                   
Class_Chlorobia        0.00003748186      0.00002796176                
Class_Deinococci       0.00003170000      0.00001954124   0.00002000000
unknown                0.00635252816      0.00637999757   0.00638193215 0.00638352917

[your fifth bacterium had no column name, so I set it to 'unknown']

[your fourth bacterium had no data, so, it was removed by the addition of *na.rm=TRUE to vegdist]*

With the default Bray-Curtis method, it produces a warning:

vegdist(data.matrix(t(otus)), method="bray", na.rm=TRUE)
                   lass_Thermobacula Class_Anaerolineae Class_Chlorobia Class_Deinococci
Class_Anaerolineae         1.0000000                                   
Class_Chlorobia            1.0000000          1.0000000                
Class_Deinococci           1.0000000          1.0000000       1.0000000
unknown                    0.9925103          0.9934758       0.9952681 1.0000000
Warning message:
In vegdist(data.matrix(t(otus)), method = "bray", na.rm = TRUE) :
  you have empty rows: their dissimilarities may be meaningless in method “bray”

-----------------------

You can also convert NAs to zeros, provide you understand exactly why values were assigned NA in the first place.

otus[is.na(otus)] <- 0
vegdist(data.matrix(t(otus)), method="euclidean", na.rm=TRUE)
                   lass_Thermobacula Class_Anaerolineae Class_Chlorobia Class_Deinococci
Class_Anaerolineae     0.00003723909                                   
Class_Chlorobia        0.00003748186      0.00002796176                
Class_Deinococci       0.00003170000      0.00001954124   0.00002000000
unknown                0.00635252816      0.00637999757   0.00638193215 0.00638352917

---------------------------

hc1 <- hclust(as.dist(vegdist(data.matrix(t(otus)), method="euclidean", na.rm=TRUE)), method="ward.D2")
hc2 <- hclust(as.dist(vegdist(data.matrix(otus), method="euclidean", na.rm=TRUE)), method="ward.D2")

par(mfrow=c(1,2))
plot(hc1); plot(hc2)

If you want a circular dendrogram, see A: how to draw circular dendrogram with distance information

ADD COMMENT • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks a lot, actually all of my questions are related. I have 16SrRNA counts of 4914 bacterial taxa for 323 individuals. I have SNPs (bim/fam/bed) for these people too. I am trying to find SNPs-bacterial associations. GEMMA and microbiome GWAS are Linux based then I am trying R codes in windows. By this link http://metagenome.cs.umn.edu/mice8992/repo/data/guerreronegro/ordination.html

I was trying to calculate principal components when I got error. Then I was trying SNPs filtering and GWAS on bacteria-SNPs by this link

http://www.stat-gen.org/str/Mod3_Lab1_GWAA_typedSNPs.html

where you wrote a code for me to model my 4914 variables (16SrRNA counts for 4914 bacterial taxa).

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

I am trying to find SNPs-bacterial associations

I am still trying to imagine why there would be logical associations between SNP's people have and their microbiomes. Where exactly is the microbiome data from (stool or some other source)? Were these individuals consuming a controlled diet/were in controlled environment? Is this data a one-time snap shot or are there multiple time points for same individuals over time?

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

Thank you. Actually they did not tell me the logic behind their experiments, only the 16SrRNA counts for 323 individuals in addition to their genotypes. They asked to find if there is SNPs related to microbiota. I am trying to find SNPs associated with the higher principals components of bacterial taxa by a linear model.

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Actually I need to extract

principal components

, fourth bacterium has data just has been removed in copy paste just table has NA

otus=otus[is.na(otus)] <- 0
> vegdist(data.matrix(t(otus)), method="bray", na.rm=TRUE)
dist(0)

Warning message: In vegdist(data.matrix(t(otus)), method = "bray", na.rm = TRUE) : you have empty rows: their dissimilarities may be meaningless in method “bray”

https://www.imageupload.co.uk/image/EAjD

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

The line:

otus=otus[is.na(otus)] <- 0

This should just be:

otus[is.na(otus)] <- 0

You use Windows?

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you, yes I am in windows

by replacing NA with zero

otus[is.na(otus)] <- 0
> View(otus)
> vegdist(data.matrix(t(otus)), method="bray", na.rm=TRUE)
Error in rowSums(x, na.rm = TRUE) : 'x' must be numeric

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Ensure that there are just numerical values in otus. You may have empty cells or some other text denoting missing values, such as 'ND'.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

detect<-apply(OTUS, 2, function(x){as.numeric(as.character(x))})
vegdist(data.matrix(t(detect)), method="bray", na.rm=TRUE)
pc.euc <- cmdscale(cc)

NOW,

str(pca.otus) List of 7 $

 sdev    : Named num [1:4914] 3.75 2.6 1.41 1.14 0.91 ...
  ..- attr(*, "names")= chr [1:4914] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
 $ loadings: loadings [1:4914, 1:4914] -0.0157 -0.0222 -0.0131 -0.0101 0.0118 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4914] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
 $ center  : Named num [1:4914] 0.928 0.898 0.935 0.95 0.938 ...
  ..- attr(*, "names")= chr [1:4914] "Class_Thermobacula" "Class_Anaerolineae" "Class_Chlorobia" "Class_Deinococci" ...
 $ scale   : Named num [1:4914] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:4914] "Class_Thermobacula" "Class_Anaerolineae" "Class_Chlorobia" "Class_Deinococci" ...
 $ n.obs   : int 4914
 $ scores  : num [1:4914, 1:4914] 3.15 5.29 2.33 1.36 -4.17 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4914] "Class_Thermobacula" "Class_Anaerolineae" "Class_Chlorobia" "Class_Deinococci" ...
  .. ..$ : chr [1:4914] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
 $ call    : language princomp(x = cc)
 - attr(*, "class")= chr "princomp"

I should try to figure out how write class of PCs in dataframe....

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

I also used cmdscale but it returned a standard list of eigenvectors for me. I was then easily able to plot these in a bi-plot. I'm not sure why your cmdscale function produces all of this other information? I used cmdscale from MASS package.

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

From the sample of data supplied it appears that subjects have NA values for all columns or no input at all. Since these are failed subjects I would remove these subjects instead of converting NA's to 0.

ADD REPLY • link 6.4 years ago by pfs ▴ 280

0

Entering edit mode

In this situation, where a sample has all NAs or all zeros, the program appears to ignore them anyway:

vegdist(data.matrix(t(otus)), method="euclidean", na.rm=TRUE)
                       lass_Thermobacula Class_Anaerolineae Class_Chlorobia Class_Deinococci
Class_Anaerolineae     0.00003723909                                   
Class_Chlorobia        0.00003748186      0.00002796176                
Class_Deinococci       0.00003170000      0.00001954124     0.00002000000
unknown                0.00635252816      0.00637999757     0.00638193215   0.00638352917

otus[is.na(otus)] <- 0
vegdist(data.matrix(t(otus)), method="euclidean", na.rm=TRUE)
                       class_Thermobacula Class_Anaerolineae Class_Chlorobia Class_Deinococci
Class_Anaerolineae     0.00003723909                                   
Class_Chlorobia        0.00003748186      0.00002796176                
Class_Deinococci       0.00003170000      0.00001954124      0.00002000000
unknown                0.00635252816      0.00637999757      0.00638193215   0.00638352917

In certain situations, it may of course be more appropriate to just remove the NAs, but it depends on how one classifies 'NA' in relation to one's method(s) of measurement. NA means different things in different contexts, and it's up to the OP to understand whether it's appropriate to convert to zero or throw out that information.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Euclidian distance was OK with NA but Bray not

I just adapted

Library(vegan)

   # get Euclidean distance
d.euc <- dist(otus)

# Run PCoA (not PCA)
pc.euc <- cmdscale(d.euc, k=2)

from this link http://metagenome.cs.umn.edu/mice8992/repo/data/guerreronegro/beta.diversity.html

likely gives me principal coordinates (PCoA). I read in this paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0759-1

that they used principal coordinates (PCoA) for correlation analysis with SNPs. but I just wonder why I have just two PCoAs not more?????

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

You only have 2 dimensions because you just specified k=2

k: the maximum dimension of the space which the data are to be represented in; must be in {1, 2, ..., n-1}.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you by K=4 I got four components, you are all right

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Sorry,

How can I show 1 million bases surrounding a gene that contains my SNPs??

This is my SNP file

head(data)
         SNP   p.value  Neg_logP chr  position  type gene
1  rs3754689 0.2348511 0.6292074   2 136590746 typed  LCT
2  rs9636213 0.2433076 0.6138443   2 136589278 typed  LCT
3 rs12988076 0.2823770 0.5491707   2 136569848 typed  LCT
4  rs2322660 0.2908767 0.5362911   2 136557319 typed  LCT
5  rs2322659 0.2973054 0.5267972   2 136555659 typed  LCT
6   rs745500 0.3250106 0.4881024   2 136583192 typed  LCT

I tried Manhatan plot but this not what I want, consider please

https://ibb.co/cDbncw

https://ibb.co/chfoHw

I also tried Regional association plots but this is my error

Reading GWAS datasets ...
PLINK format detected
Assembling regions...
Initializing biomart connection (SNPs)...
Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Not sure what you want to do... you want to get all SNPs that are 1 megabase (1 million bases) surrounding LCT?

Coincidentally, LCT (specifically rs3754689, also in your data) came up in an exome sequencing study of which I was part when I was working in Italy:

Whole-exome sequencing to identify genetic risk variants underlying inhibitor development in severe hemophilia A patients

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Sounds to me like just show a million base interval around the SNP instead of full chromosome.

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

Thank you, I think you right, I don't know how show that. by Regional association plots gives me this error

Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

I'm not sure which program you're aiming to use. You can subset the Manhattan plot to a specific region, though:

require(qqman)

manhattan <- data[,c("SNP", "chr", "position", "p.value")]

colnames(manhattan) <- c("SNP", "CHR", "BP", "P")

minBP <- (min(manhattan$BP) - 500000) / 1000000
maxBP <- (max(manhattan$BP) + 500000) / 1000000

manhattan(manhattan, main=bquote(italic(LCT)~locus), highlight=NULL, suggestiveline=-log10(0.0001), genomewideline=-log10(2.5E-07), col=c("springgreen4", "firebrick"), xlim=c(minBP,maxBP), annotatePval=0.05, annotateTop=TRUE)

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you so much, this is whole of my data

SNP p.value Neg_logP    chr position    type    gene
rs3754689   0.234851075 0.629207447 2   136590746   typed   LCT
rs9636213   0.24330764  0.613844253 2   136589278   typed   LCT
rs12988076  0.282377011 0.549170663 2   136569848   typed   LCT
rs2322660   0.290876656 0.536291131 2   136557319   typed   LCT
rs2322659   0.297305417 0.526797177 2   136555659   typed   LCT
rs745500    0.325010646 0.488102413 2   136583192   typed   LCT
rs3769008   0.34295286  0.464765571 2   136586183   typed   LCT
rs3769013   0.35305602  0.452156379 2   136556182   typed   LCT
rs2304371   0.414016688 0.382982153 2   136561557   typed   LCT
rs3213890   0.440833185 0.35572572  2   136552188   typed   LCT
rs746857    0.473643203 0.324548691 2   136554890   typed   LCT
rs3754690   0.626450686 0.203113111 2   136590558   typed   LCT
rs3739022   0.633182068 0.198471393 2   136562472   typed   LCT
rs16832006  0.638593182 0.194775722 2   136543147   typed   LCT
rs1807356   0.74475088  0.127988975 2   136555484   typed   LCT
rs16832011  0.750458296 0.124673437 2   136545299   typed   LCT
rs4954633   0.78956937  0.102609708 2   136546635   typed   LCT
rs16832067  0.795462221 0.099380442 2   136584879   typed   LCT

I run your code

manhattan(manhattan, main=bquote(italic(LCT)~locus), highlight=NULL,suggestiveline=-log10(0.0001), genomewideline=-log10(2.5E-07 ), col=c("springgreen4", "firebrick"), xlim=c(minBP,maxBP), annotatePval=0.05, annotateTop=TRUE)

said that

Error in log10(topSNPs$P) : non-numeric argument to mathematical

May you please consider

https://www.imageupload.co.uk/image/El79

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

The column names have to be named exactly as c("SNP", "CHR", "BP", "P")

Also, sometimes in R, numerical values are actually saved in the data-frame as factors (categorical variables). You can check for this with the is.numeric() function, for example:

is.numeric(manhattan$p.value)

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

From the data-frame that you have there, try this (it works):

require(qqman)
colnames(manhattan) <- c("SNP","P", "Neg_logP", "CHR", "BP", "type", "gene")
minBP <- (min(manhattan$BP) - 500000) / 1000000
maxBP <- (max(manhattan$BP) + 500000) / 1000000
manhattan(manhattan, main=bquote(italic(LCT)~locus), highlight=NULL, suggestiveline=-log10(0.0001), genomewideline=-log10(2.5E-07), col=c("springgreen4", "firebrick"), xlim=c(minBP,maxBP), annotatePval=0.05, annotateTop=TRUE)

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Sorry I plotted 18 SNPs within LCT like so

https://ibb.co/jxNRNw

but in X axis it does not seem if this 1 mb length around this gene

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Yes, for this example, I just used 500,000 bases up- and down-stream of your group of SNPs. I did this using the xlim parameter, passed to the manhattan function.

The GRCh37 / hg19 co-ordinates for LCT are: 135787840-135837200

[source: https://www.ncbi.nlm.nih.gov/gene/3938]

Therefore, your xlim to the manhattan function should be:

manhattan(..., xlim=c(134.787840, 136.837200), ...)

[1 megabase up- and downstream]

manhattan(..., xlim=c(135.287840, 136.337200), ...)

[500,000 bases up- and down-stream]

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you vey much for your help

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

when I tried to remove rows or columns with NAs all of the table was removed because NAs are seemingly everywhere.

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

I wonder if this is a trick assignment to test if you will question the logic/validity of this hypothesis.

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

Hey genomax, what do you mean? - you mean her supervisors may be testing her? Does that happen?

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

She has been tirelessly working towards finding a position (I believe as a PhD student) abroad for some time. There have been similar assignments where we have provided some assistance to her in the past.

ADD REPLY • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

You all right... two weeks ago I was assigned this task… Honestly, solving this problem solely does not matter for me anymore, what is suffering me is that if I want to stay in this field I should be able to find the solution. I no longer care if he accepts or rejects me, I just want to give myself some self-confidence 

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Sounds like a difficult situation! I cannot comment that much as I do not know everything. As you are early in your career, my advice would be to not give away too much personal info on places like this.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello, everything going fine today?

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello, thanks a lot

As I did not have the age of the genotyped individuals, I therefore ignored age and sex as covariates. I used filtered typed SNPs (call rate=0.99, MAF=0.01, both for samples and SNPS) and two principal coordinates of OTUs calculated by Euclidian distance by vegan package to model SNPs-microbial community linearly. Fortunately for all I found R code and I enabled to do everything without Linux or plink. Finally, 316 individuals stayed in the analysis. I have two doubts in my analysis, one: due to the too many missing values in OTUs I used Euclidian distance for getting principal coordinates and second: in linear modelling, SNPs are my predictors and I assumed principal coordinates as my quantitative variables (I assumed that I am doing a GWAS because tools like GEMMA, microbiome GWAS, etc that are used for this modelling were Linux-based. MiRKAT does this correlation but needs SNP data by 0/1/2, then it should be outcome (Y) and the OTUS from 16SrRNAs is in the X covariates. As I could not figure out how run MiRKAT I then went on by the codes provided in your link in parallel analysis. In sum, I noticed that there was not any SNPs related to principal coordinate2. However I sent my results to my supervisor.

Thank you once more

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

2

Entering edit mode

Okay, hope everything works out! You're not alone as long as Biostars exists!

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Yes, exactly

At least for me biostars is always helpful.

Thank you

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

What happens when you just remove rows with NA?

ADD REPLY • link 6.4 years ago by pfs ▴ 280

0

Entering edit mode

I tried both removing the rows and columns with NA separately, in both cases everything was removed.

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Sorry, if this is my data from bacteria

head((OTU[,1:4]))
                          IPC001 IPC002        IPC003        IPC005       
Class_Thermobacula        NA     "0.000031700" "0.000000000" "0.000000000"
Class_Anaerolineae        NA     "0.000000000" "0.000000000" "0.000013100"
Class_Chlorobia           NA     "0.000000000" "0.000020000" "0.000000000"
Class_Deinococci          NA     "0.000000000" "0.000000000" "0.000000000"
Class_Actinobacteria      NA     "0.006243463" "0.000519667" "0.001062978"
Class_Alphaproteobacteria NA     "0.002123411" "0.000679565" "0.000931746"



  > tail((OTU[,1:4]))
                                           IPC001        IPC002        IPC003       
    Species_Blautia_coccoides.8            "0.027717731" "0.000747908" "0.006174639"
    Species_Lachnospira_pectinoschiza.8    "0.000288226" "0.000514187" "0.004054591"
    Species_Bacteroides_dorei.8            "0.000144113" "0.008951526" "0.018735922"
    Species_Faecalibacterium_prausnitzii.8 "0.000096100" "0.011662693" "0.028223135"
    Species_Bacteroides_vulgatus.8         "0.000864678" "0.074253260" "0.117159136"
    Species_Prevotella_copri.8             "0.000144113" "0.084560370" "0.000026500"
                                           IPC005
    Species_Blautia_coccoides.8            NA    
    Species_Lachnospira_pectinoschiza.8    NA    
    Species_Bacteroides_dorei.8            NA    
    Species_Faecalibacterium_prausnitzii.8 NA    
    Species_Bacteroides_vulgatus.8         NA    
    Species_Prevotella_copri.8             NA    
    >

How I can collapsing this OTUs abundance by summing counts over rows that correspond to identical taxa (Class, Genus, etc) to reduce dimentiality????

http://uploads.im/skPIw.png

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

For that, you should take a look at the aggregate function in R.

--------------------------------

If I start with:

OTU
                                            IPC001      IPC002      IPC003
Class_Thermobacula                              NA 0.000031700 0.000000000
Class_Anaerolineae                              NA 0.000000000 0.000000000
Class_Chlorobia                                 NA 0.000000000 0.000020000
Class_Deinococci                                NA 0.000000000 0.000000000
Class_Actinobacteria                            NA 0.006243463 0.000519667
Class_Alphaproteobacteria                       NA 0.002123411 0.000679565
Species_Blautia_coccoides.8            0.027717731 0.000747908 0.006174639
Species_Lachnospira_pectinoschiza.8    0.000288226 0.000514187 0.004054591
Species_Bacteroides_dorei.8            0.000144113 0.008951526 0.018735922
Species_Faecalibacterium_prausnitzii.8 0.000096100 0.011662693 0.028223135
Species_Bacteroides_vulgatus.8         0.000864678 0.074253260 0.117159136
Species_Prevotella_copri.8             0.000144113 0.084560370 0.000026500
                                            IPC005
Class_Thermobacula                     0.000000000
Class_Anaerolineae                     0.000013100
Class_Chlorobia                        0.000000000
Class_Deinococci                       0.000000000
Class_Actinobacteria                   0.001062978
Class_Alphaproteobacteria              0.000931746
Species_Blautia_coccoides.8                     NA
Species_Lachnospira_pectinoschiza.8             NA
Species_Bacteroides_dorei.8                     NA
Species_Faecalibacterium_prausnitzii.8          NA
Species_Bacteroides_vulgatus.8                  NA
Species_Prevotella_copri.8                      NA

I need to create a new column that will instruct the aggregate function how the data should be summed. In this example, I create a column that just has the Class names of each bacterium:

OTU.sum <- data.frame(rownames(OTU), OTU)
OTU.sum[,1]
OTU.sum[,1]
 [1] Class_Thermobacula                    
 [2] Class_Anaerolineae                    
 [3] Class_Chlorobia                       
 [4] Class_Deinococci                      
 [5] Class_Actinobacteria                  
 [6] Class_Alphaproteobacteria             
 [7] Species_Blautia_coccoides.8           
 [8] Species_Lachnospira_pectinoschiza.8   
 [9] Species_Bacteroides_dorei.8           
[10] Species_Faecalibacterium_prausnitzii.8
[11] Species_Bacteroides_vulgatus.8        
[12] Species_Prevotella_copri.8

Tidy up the names with gsub to just get the Class name:

OTU.sum[,1] <- gsub("Class_|Species_", "", OTU.sum[,1])
OTU.sum[,1] <- gsub("\\.[0-9]*$", "", OTU.sum[,1])
OTU.sum[,1] <- gsub("_[a-zA-Z]*$", "", OTU.sum[,1])
OTU.sum[,1]
 [1] "Thermobacula"        "Anaerolineae"        "Chlorobia"          
 [4] "Deinococci"          "Actinobacteria"      "Alphaproteobacteria"
 [7] "Blautia"             "Lachnospira"         "Bacteroides"        
[10] "Faecalibacterium"    "Bacteroides"         "Prevotella"

Now, sum over the bacteria of the same class according to the column that we created:

OTU.sum <- aggregate(OTU.sum[2:ncol(OTU.sum)], OTU.sum[1], sum)
rownames(OTU.sum) <- OTU.sum[,1]
OTU.sum <- OTU.sum[,-1]
OTU.sum
                         IPC001      IPC002      IPC003      IPC005
Actinobacteria               NA 0.006243463 0.000519667 0.001062978
Alphaproteobacteria          NA 0.002123411 0.000679565 0.000931746
Anaerolineae                 NA 0.000000000 0.000000000 0.000013100
Bacteroides         0.001008791 0.083204786 0.135895058          NA
Blautia             0.027717731 0.000747908 0.006174639          NA
Chlorobia                    NA 0.000000000 0.000020000 0.000000000
Deinococci                   NA 0.000000000 0.000000000 0.000000000
Faecalibacterium    0.000096100 0.011662693 0.028223135          NA
Lachnospira         0.000288226 0.000514187 0.004054591          NA
Prevotella          0.000144113 0.084560370 0.000026500          NA
Thermobacula                 NA 0.000031700 0.000000000 0.000000000

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

> OTU.sum <- aggregate(OTU.sum[2:ncol(OTU.sum)], OTU.sum[1], sum)
Error in Summary.factor(2L, na.rm = FALSE) : 
  ‘sum’ not meaningful for factors
>

like so http://uploads.im/gekOU.jpg

Thank you, this data containing three locations "IL,TR, RE" & three amplicons "v1v2, v3v4, v5v6" & six phylogentic levels "Class, Family, Genus, Order, Phylum, Species" ), for example:

Class_Thermobacula for amplicon v1v2 location RE

Family_Thermobaculaceae  for amplicon v1v2 location RE

Order_Thermobaculales for amplicon v1v2 location RE

Class_Thermobacula for amplicon v1v2     location  IL

Family_Thermobaculaceae for amplicon v1v2 location  IL

Order_Thermobaculales    for amplicon v1v2 location  IL

Class_Thermobacula for amplicon v1v2 location TR

Family_Thermobaculaceae  for amplicon v1v2 location TR

Order_Thermobaculales for amplicon v1v2 location TR

Class_Thermobacula for amplicon v5v6 location TR

Family_Thermobaculaceae  for amplicon v5v6 location TR

Order_Thermobaculales for amplicon v5v6 location TR

However I am so sorry for disturbing you.

ADD REPLY • link 6.4 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Yes, I noticed that, in your data, your numerical values are stored as characters. This will mean that they will be interpreted as categorical variables ('factors'). You should first convert these to numerical with something like:

data.matrix(OTU)

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k