R extract gene names and protein descriptions from concatenated fasta headers
0
0
Entering edit mode
2.9 years ago

I have a column of fasta headers in uniprot style

Some rows are single fasta headers and some multiple fasta headers separated by semicolons:

Example (row 1 single fasta header, row 2 three fasta headers concatenated with semicolons).

 df<- data.frame(
      fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"))

I have tried some regex as follows, but it only works on the last instance, but I would like all matched instances

df$'protein names'=ifelse(grepl(".*_PIG\ (.*)\ OS.*", df$fasta_headers), 
                                gsub(".*_PIG\ (.*)\ OS.*", "\\1", df$fasta_headers), 
                                "") 

df$'gene names'= ifelse(grepl(".* GN=([^ ]+).*", df$fasta_headers), 
                               gsub(".* GN=([^ ]+).*", "\\1", df$fasta_headers), 
                               "")

the desired output should be

df_out <- data.frame(
  fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"),
  gene_names = c("VWA1","stat5B; STAT5A"),
  protein_names = c("von Willebrand factor A domain containing 1","Signal transducer and activator of transcription; Signal transducer and activator of transcription 5A"))

There could be any number of semicolons, not just 2 or 3.

Any help would be appreciated.

uniprot fasta R • 1.2k views
ADD COMMENT
0
Entering edit mode

why is von Willebrand factor A domain containing 1 is excluded? If this is a stand alone op, it can be done outside R.

ADD REPLY
0
Entering edit mode

Thanks, I have updated the question to make it clearer what the desired outcome should be. I do not wish to filter, just parse the fasta header information. There are many other data columns attached to the real data frame, not a stand alone fasta file.

ADD REPLY
0
Entering edit mode

You already have fasta_headers, you can build data frame from following vectors:

> library(stringr)
> str_match_all(df$fasta_headers, "PIG (.*?) OS") %>% 
+     lapply(., function (x) str_c(x[,2],collapse='; ')) %>% 
+     unlist()
[1] "von Willebrand factor A domain containing 1"                                                          
[2] "Signal transducer and activator of transcription; Signal transducer and activator of transcription 5A"
> str_match_all(df$fasta_headers, "GN=(.*?) PE") %>% 
+     lapply(., function (x) str_c(x[,2],collapse='; ')) %>% 
+     unlist()
[1] "VWA1"           "stat5B; STAT5A"
ADD REPLY

Login before adding your answer.

Traffic: 1376 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6