Question: How to append a data.frame in R with new data
0
gravatar for hakimelakhrass
2.7 years ago by
hakimelakhrass80 wrote:

So I am trying to execute the follow code on a list of ID's instead of an individual ID:

source("https://bioconductor.org/biocLite.R")
#install.packages('reutils')
#install.packages('Peptides')
#biocLite(pkgs = c('GenomeInfoDb','GenomicRanges'))
#install.packages('plyr')
#install.packages('devtools')
#devtools::install_github("gschofl/biofiles")
library(Peptides)
library(reutils)
library(Biostrings)
library(biofiles)
library(plyr)
library(stringr)
library(tibble)
#install.packages('data.table')
library(data.table)

#this exactly the end format of that data frame I want but instead of 1 UID like 124511 a list of UIDs 
fetch <- efetch(124511, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)
rec <- gbRecord(fetch)
seq <- getSequence((ft(rec)))
m <- as.data.frame(seq)
setnames(m, "x", "sequence")
protienName <- names(seq)
m <- add_column(m, protienName, .after = 0)
m$molecularweight <- mw(m$sequence)
m$m<- str_count(m$sequence, 'm')
m$cc <- str_count(m$sequence, 'cc')
logvec <- grepl('(Protein)|(Region)', m$protienName)
m <- subset(m, logvec)

The problem is efetch() can only use one ID at a time. So I must either write a for loop or use the apply function on the list of protein IDs. If I were to take the code as is and tried to make it for a list each iteration would delete the previous one. Therefore I was hoping someone can help me append the data.frame each time or show me a way that each iteration wouldn't replace the previous.

protiens genebank R ncbi • 1.4k views
ADD COMMENTlink modified 2.7 years ago by Alex Reynolds29k • written 2.7 years ago by hakimelakhrass80

efetch can take a list of ids https://www.rdocumentation.org/packages/reutils/versions/0.2.2/topics/efetch

ADD REPLYlink written 2.7 years ago by Santosh Anand5.0k

Wow...how about that. For some reason I really thought it couldn't! Thanks I will try just feeding it a list then. Thanks.

ADD REPLYlink written 2.7 years ago by hakimelakhrass80
2
gravatar for Alex Reynolds
2.7 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could have a function called fetch_id() that returns a record for a given id:

fetch_id <- function(id) { return(efetch(id, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)); }

Then, given a vector of IDs:

ids <- c(124511, 124512, 124513, ... )

you can pre-allocate a list:

l <- vector(mode = "list", length = length(ids))

Then iterate over ids to populate the list l with fetched results:

l <- lapply(ids, function(id) { fetch_id(id); })

Once you have this list object populated, you can run lapply() on it to run a function of your choice on each element of it:

process_fetched_record <- function(fr) {
    rec <- gbRecord(fr)
    seq <- getSequence((ft(rec)))
    m <- as.data.frame(seq)
    setnames(m, "x", "sequence")
    protienName <- names(seq)
    m <- add_column(m, protienName, .after = 0)
    m$molecularweight <- mw(m$sequence)
    m$m<- str_count(m$sequence, 'm')
    m$cc <- str_count(m$sequence, 'cc')
    logvec <- grepl('(Protein)|(Region)', m$protienName)
    m <- subset(m, logvec)
    return(m)
}

m <- lapply(l, function(fr) { process_fetched_record(fr); })

Then you have a list m that you can access by index.

ADD COMMENTlink written 2.7 years ago by Alex Reynolds29k
0
gravatar for jared.j.tromp
2.7 years ago by
jared.j.tromp0 wrote:

I'm not familiar with your code and I made a few adjustments to get it to run so hope it is still producing what you want...

But you asked for a loop that would iterate over your code without re-writing the output each time, hope this helps.

#Add list/data frame here (added test genes, also use 1:nrow(gene_list$gene) for data frame)
gene_list <- c("124511", "124512", "124513")

#create empty vector to collect input
list_collection <- NULL

#loop
for(gene in 1:length(gene_list)){

  #Select genes one by one
  geneid <- gene_list[gene]

  #Your code 
  fetch <- efetch(geneid, db="protein", rettype = 'gp', retmode = "text")
  rec <- gbRecord(fetch)
  seq <- getSequence((ft(rec)))
  m <- as.data.frame(seq)
  setnames(m, "x", "sequence")
  protienName <- names(seq)
  m <- add_column(m, protienName, .after = 0)
  m$molecularweight <- mw(m$sequence)
  m$m<- str_count(m$sequence, 'm')
  m$cc <- str_count(m$sequence, 'cc')
  logvec <- grepl('(Protein)|(Region)', m$protienName)
  m <- subset(m, logvec)

  #Add unique gene to the first column for downstream filtering 
  m <- data.frame(geneid, m) 

  #Collect information by row without re-writing 
  list_collection <- rbind(list_collection, m)
}
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by jared.j.tromp0
1

Probably not a good idea to name a variable after a keyword (list).

ADD REPLYlink written 2.7 years ago by Alex Reynolds29k
0
gravatar for zjhzwang
2.7 years ago by
zjhzwang180
zjhzwang180 wrote:

Maybe you can use mutate{dplyr}, it can adds new variables and preserves existing.

ADD COMMENTlink written 2.7 years ago by zjhzwang180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1974 users visited in the last hour