Entering edit mode
2.7 years ago
moritz.lasse
•
0
I have a column of fasta headers in uniprot style
Some rows are single fasta headers and some multiple fasta headers separated by semicolons:
Example (row 1 single fasta header, row 2 three fasta headers concatenated with semicolons).
df<- data.frame(
fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"))
I have tried some regex as follows, but it only works on the last instance, but I would like all matched instances
df$'protein names'=ifelse(grepl(".*_PIG\ (.*)\ OS.*", df$fasta_headers),
gsub(".*_PIG\ (.*)\ OS.*", "\\1", df$fasta_headers),
"")
df$'gene names'= ifelse(grepl(".* GN=([^ ]+).*", df$fasta_headers),
gsub(".* GN=([^ ]+).*", "\\1", df$fasta_headers),
"")
the desired output should be
df_out <- data.frame(
fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"),
gene_names = c("VWA1","stat5B; STAT5A"),
protein_names = c("von Willebrand factor A domain containing 1","Signal transducer and activator of transcription; Signal transducer and activator of transcription 5A"))
There could be any number of semicolons, not just 2 or 3.
Any help would be appreciated.
why is
von Willebrand factor A domain containing 1
is excluded? If this is a stand alone op, it can be done outside R.Thanks, I have updated the question to make it clearer what the desired outcome should be. I do not wish to filter, just parse the fasta header information. There are many other data columns attached to the real data frame, not a stand alone fasta file.
You already have fasta_headers, you can build data frame from following vectors: