Question: How to append a data.frame in R with new data
0
gravatar for hakimelakhrass
24 months ago by
hakimelakhrass70 wrote:

So I am trying to execute the follow code on a list of ID's instead of an individual ID:

source("https://bioconductor.org/biocLite.R")
#install.packages('reutils')
#install.packages('Peptides')
#biocLite(pkgs = c('GenomeInfoDb','GenomicRanges'))
#install.packages('plyr')
#install.packages('devtools')
#devtools::install_github("gschofl/biofiles")
library(Peptides)
library(reutils)
library(Biostrings)
library(biofiles)
library(plyr)
library(stringr)
library(tibble)
#install.packages('data.table')
library(data.table)

#this exactly the end format of that data frame I want but instead of 1 UID like 124511 a list of UIDs 
fetch <- efetch(124511, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)
rec <- gbRecord(fetch)
seq <- getSequence((ft(rec)))
m <- as.data.frame(seq)
setnames(m, "x", "sequence")
protienName <- names(seq)
m <- add_column(m, protienName, .after = 0)
m$molecularweight <- mw(m$sequence)
m$m<- str_count(m$sequence, 'm')
m$cc <- str_count(m$sequence, 'cc')
logvec <- grepl('(Protein)|(Region)', m$protienName)
m <- subset(m, logvec)

The problem is efetch() can only use one ID at a time. So I must either write a for loop or use the apply function on the list of protein IDs. If I were to take the code as is and tried to make it for a list each iteration would delete the previous one. Therefore I was hoping someone can help me append the data.frame each time or show me a way that each iteration wouldn't replace the previous.

protiens genebank R ncbi • 1.1k views
ADD COMMENTlink modified 24 months ago by Alex Reynolds27k • written 24 months ago by hakimelakhrass70

efetch can take a list of ids https://www.rdocumentation.org/packages/reutils/versions/0.2.2/topics/efetch

ADD REPLYlink written 24 months ago by Santosh Anand4.6k

Wow...how about that. For some reason I really thought it couldn't! Thanks I will try just feeding it a list then. Thanks.

ADD REPLYlink written 24 months ago by hakimelakhrass70
2
gravatar for Alex Reynolds
24 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

You could have a function called fetch_id() that returns a record for a given id:

fetch_id <- function(id) { return(efetch(id, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)); }

Then, given a vector of IDs:

ids <- c(124511, 124512, 124513, ... )

you can pre-allocate a list:

l <- vector(mode = "list", length = length(ids))

Then iterate over ids to populate the list l with fetched results:

l <- lapply(ids, function(id) { fetch_id(id); })

Once you have this list object populated, you can run lapply() on it to run a function of your choice on each element of it:

process_fetched_record <- function(fr) {
    rec <- gbRecord(fr)
    seq <- getSequence((ft(rec)))
    m <- as.data.frame(seq)
    setnames(m, "x", "sequence")
    protienName <- names(seq)
    m <- add_column(m, protienName, .after = 0)
    m$molecularweight <- mw(m$sequence)
    m$m<- str_count(m$sequence, 'm')
    m$cc <- str_count(m$sequence, 'cc')
    logvec <- grepl('(Protein)|(Region)', m$protienName)
    m <- subset(m, logvec)
    return(m)
}

m <- lapply(l, function(fr) { process_fetched_record(fr); })

Then you have a list m that you can access by index.

ADD COMMENTlink written 24 months ago by Alex Reynolds27k
0
gravatar for jared.j.tromp
24 months ago by
jared.j.tromp0 wrote:

I'm not familiar with your code and I made a few adjustments to get it to run so hope it is still producing what you want...

But you asked for a loop that would iterate over your code without re-writing the output each time, hope this helps.

#Add list/data frame here (added test genes, also use 1:nrow(gene_list$gene) for data frame)
gene_list <- c("124511", "124512", "124513")

#create empty vector to collect input
list_collection <- NULL

#loop
for(gene in 1:length(gene_list)){

  #Select genes one by one
  geneid <- gene_list[gene]

  #Your code 
  fetch <- efetch(geneid, db="protein", rettype = 'gp', retmode = "text")
  rec <- gbRecord(fetch)
  seq <- getSequence((ft(rec)))
  m <- as.data.frame(seq)
  setnames(m, "x", "sequence")
  protienName <- names(seq)
  m <- add_column(m, protienName, .after = 0)
  m$molecularweight <- mw(m$sequence)
  m$m<- str_count(m$sequence, 'm')
  m$cc <- str_count(m$sequence, 'cc')
  logvec <- grepl('(Protein)|(Region)', m$protienName)
  m <- subset(m, logvec)

  #Add unique gene to the first column for downstream filtering 
  m <- data.frame(geneid, m) 

  #Collect information by row without re-writing 
  list_collection <- rbind(list_collection, m)
}
ADD COMMENTlink modified 24 months ago • written 24 months ago by jared.j.tromp0
1

Probably not a good idea to name a variable after a keyword (list).

ADD REPLYlink written 24 months ago by Alex Reynolds27k
0
gravatar for zjhzwang
24 months ago by
zjhzwang180
zjhzwang180 wrote:

Maybe you can use mutate{dplyr}, it can adds new variables and preserves existing.

ADD COMMENTlink written 24 months ago by zjhzwang180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 905 users visited in the last hour