Question

R extract gene names and protein descriptions from concatenated fasta headers

0

Entering edit mode

2.9 years ago

moritz.lasse • 0

I have a column of fasta headers in uniprot style

Some rows are single fasta headers and some multiple fasta headers separated by semicolons:

Example (row 1 single fasta header, row 2 three fasta headers concatenated with semicolons).

 df<- data.frame(
      fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"))

I have tried some regex as follows, but it only works on the last instance, but I would like all matched instances

df$'protein names'=ifelse(grepl(".*_PIG\ (.*)\ OS.*", df$fasta_headers), 
                                gsub(".*_PIG\ (.*)\ OS.*", "\\1", df$fasta_headers), 
                                "") 

df$'gene names'= ifelse(grepl(".* GN=([^ ]+).*", df$fasta_headers), 
                               gsub(".* GN=([^ ]+).*", "\\1", df$fasta_headers), 
                               "")

the desired output should be

df_out <- data.frame(
  fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"),
  gene_names = c("VWA1","stat5B; STAT5A"),
  protein_names = c("von Willebrand factor A domain containing 1","Signal transducer and activator of transcription; Signal transducer and activator of transcription 5A"))

There could be any number of semicolons, not just 2 or 3.

Any help would be appreciated.

uniprot fasta R • 1.2k views

ADD COMMENT • link updated 2.9 years ago by cpad0112 21k • written 2.9 years ago by moritz.lasse • 0

0

Entering edit mode

why is von Willebrand factor A domain containing 1 is excluded? If this is a stand alone op, it can be done outside R.

ADD REPLY • link 2.9 years ago by cpad0112 21k

0

Entering edit mode

Thanks, I have updated the question to make it clearer what the desired outcome should be. I do not wish to filter, just parse the fasta header information. There are many other data columns attached to the real data frame, not a stand alone fasta file.

ADD REPLY • link 2.9 years ago by moritz.lasse • 0

0

Entering edit mode

You already have fasta_headers, you can build data frame from following vectors:

> library(stringr)
> str_match_all(df$fasta_headers, "PIG (.*?) OS") %>% 
+     lapply(., function (x) str_c(x[,2],collapse='; ')) %>% 
+     unlist()
[1] "von Willebrand factor A domain containing 1"                                                          
[2] "Signal transducer and activator of transcription; Signal transducer and activator of transcription 5A"
> str_match_all(df$fasta_headers, "GN=(.*?) PE") %>% 
+     lapply(., function (x) str_c(x[,2],collapse='; ')) %>% 
+     unlist()
[1] "VWA1"           "stat5B; STAT5A"

ADD REPLY • link 2.9 years ago by cpad0112 21k