Question

How to append a data.frame in R with new data

0

Entering edit mode

8.3 years ago

hakimelakhrass ▴ 80

So I am trying to execute the follow code on a list of ID's instead of an individual ID:

source("https://bioconductor.org/biocLite.R")
#install.packages('reutils')
#install.packages('Peptides')
#biocLite(pkgs = c('GenomeInfoDb','GenomicRanges'))
#install.packages('plyr')
#install.packages('devtools')
#devtools::install_github("gschofl/biofiles")
library(Peptides)
library(reutils)
library(Biostrings)
library(biofiles)
library(plyr)
library(stringr)
library(tibble)
#install.packages('data.table')
library(data.table)

#this exactly the end format of that data frame I want but instead of 1 UID like 124511 a list of UIDs 
fetch <- efetch(124511, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)
rec <- gbRecord(fetch)
seq <- getSequence((ft(rec)))
m <- as.data.frame(seq)
setnames(m, "x", "sequence")
protienName <- names(seq)
m <- add_column(m, protienName, .after = 0)
m$molecularweight <- mw(m$sequence)
m$m<- str_count(m$sequence, 'm')
m$cc <- str_count(m$sequence, 'cc')
logvec <- grepl('(Protein)|(Region)', m$protienName)
m <- subset(m, logvec)

The problem is efetch() can only use one ID at a time. So I must either write a for loop or use the apply function on the list of protein IDs. If I were to take the code as is and tried to make it for a list each iteration would delete the previous one. Therefore I was hoping someone can help me append the data.frame each time or show me a way that each iteration wouldn't replace the previous.

R Genebank NCBI protiens • 3.4k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 8.3 years ago by hakimelakhrass ▴ 80

0

Entering edit mode

efetch can take a list of ids https://www.rdocumentation.org/packages/reutils/versions/0.2.2/topics/efetch

ADD REPLY • link 8.3 years ago by Santosh Anand 5.8k

0

Entering edit mode

Wow...how about that. For some reason I really thought it couldn't! Thanks I will try just feeding it a list then. Thanks.

ADD REPLY • link 8.3 years ago by hakimelakhrass ▴ 80

score 2 · Answer 1 · 2017-03-27

You could have a function called fetch_id() that returns a record for a given id:

fetch_id <- function(id) { return(efetch(id, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)); }

Then, given a vector of IDs:

ids <- c(124511, 124512, 124513, ... )

you can pre-allocate a list:

l <- vector(mode = "list", length = length(ids))

Then iterate over ids to populate the list l with fetched results:

l <- lapply(ids, function(id) { fetch_id(id); })

Once you have this list object populated, you can run lapply() on it to run a function of your choice on each element of it:

process_fetched_record <- function(fr) {
    rec <- gbRecord(fr)
    seq <- getSequence((ft(rec)))
    m <- as.data.frame(seq)
    setnames(m, "x", "sequence")
    protienName <- names(seq)
    m <- add_column(m, protienName, .after = 0)
    m$molecularweight <- mw(m$sequence)
    m$m<- str_count(m$sequence, 'm')
    m$cc <- str_count(m$sequence, 'cc')
    logvec <- grepl('(Protein)|(Region)', m$protienName)
    m <- subset(m, logvec)
    return(m)
}

m <- lapply(l, function(fr) { process_fetched_record(fr); })

Then you have a list m that you can access by index.

Ram · Answer 2 · 2017-03-27

I'm not familiar with your code and I made a few adjustments to get it to run so hope it is still producing what you want...

But you asked for a loop that would iterate over your code without re-writing the output each time, hope this helps.

#Add list/data frame here (added test genes, also use 1:nrow(gene_list$gene) for data frame)
gene_list <- c("124511", "124512", "124513")

#create empty vector to collect input
list_collection <- NULL

#loop
for(gene in 1:length(gene_list)){

  #Select genes one by one
  geneid <- gene_list[gene]

  #Your code 
  fetch <- efetch(geneid, db="protein", rettype = 'gp', retmode = "text")
  rec <- gbRecord(fetch)
  seq <- getSequence((ft(rec)))
  m <- as.data.frame(seq)
  setnames(m, "x", "sequence")
  protienName <- names(seq)
  m <- add_column(m, protienName, .after = 0)
  m$molecularweight <- mw(m$sequence)
  m$m<- str_count(m$sequence, 'm')
  m$cc <- str_count(m$sequence, 'cc')
  logvec <- grepl('(Protein)|(Region)', m$protienName)
  m <- subset(m, logvec)

  #Add unique gene to the first column for downstream filtering 
  m <- data.frame(geneid, m) 

  #Collect information by row without re-writing 
  list_collection <- rbind(list_collection, m)
}

Ram · Answer 3 · 2017-03-27

0

Entering edit mode

8.3 years ago

zjhzwang ▴ 180

Maybe you can use dplyr::mutate, it can add new variables and preserve existing.

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 8.3 years ago by zjhzwang ▴ 180