Question

How can I add the information at the end of the line to the beginning of the line in R

2

Entering edit mode

2.2 years ago

logbio ▴ 30

I have fasta file. I want to add the information in parentheses at the end of each line to the beginning of the line without the brackets.

From:

gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]

To:

Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]

String Programming Regex R gsub • 1.1k views

ADD COMMENT • link updated 2.2 years ago by cpad0112 21k • written 2.2 years ago by logbio ▴ 30

0

Entering edit mode

if sequences are in single line and headers are in exactly in same format:

$ awk -F "[][>]" '/^>/{getline seq}{print ">"$3"_"$2,"["$3"]""\n"seq}' test.fa

>Homo sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2  [Homo sapiens]

with sed:

$ sed -r '/^>/ s/^>(.*)\s\[(.*)\]$/>\2_\1 \[\2\]/g' test.fa

>Homo sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]

ADD REPLY • link 2.2 years ago by cpad0112 21k

score 2 · Answer 1 · 2022-02-02

2

Entering edit mode

2.2 years ago

fracarb8 ★ 1.6k

I am sure there is a better way

string <- "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]" 
match <- sub("^.*\\[(.*)\\]$","\\1",string)
string <- sub("^",paste0(match,"_"), string)

# Update: Dunois Regex also works without needing stringr
string <- "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]" 
string <- sub("(^.*)\\[([A-Z]{1}[a-z]+\\s[a-z]+)\\]","\\2_\\1\\[\\2\\]",string)

> string
[1] "Homo sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]"

ADD COMMENT • link 2.2 years ago by fracarb8 ★ 1.6k

1

Entering edit mode

stringr's just for syntactic convenience.

And also, I forgot R can handle nested capture groups, so you can actually replace the regex from my solution with the much shorter sub("(.*([A-Z]+[a-z]+) ([a-z]+))", "\\2_\\3_\\1", string). Note I've also fixed the regex to account for the fact that OP wants an underscore within the species name.

There's probably an even more concise solution with a single capture group, but I can't really think of it now. (Not that this matters for the OP probably.)

ADD REPLY • link 2.2 years ago by Dunois ★ 2.5k

1

Entering edit mode

thanks for the clarification. I personally prefer to avoid being too concise when regular expression are involved.

ADD REPLY • link 2.2 years ago by fracarb8 ★ 1.6k

score 2 · Answer 2 · 2022-02-02

2

Entering edit mode

2.2 years ago

Dunois ★ 2.5k

Here you go:

library(stringr)

#Toy case.
df <- data.frame(x = "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]", stringsAsFactors = FALSE)

#Un-edited.
df$x

#[1] "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]"

#Using str_replace with nested capture groups to rearrange the text.
df$x <- str_replace(df$x, "(.*([A-Z]+[a-z]+) ([a-z]+))", "\\2_\\3_\\1")

#Result.
df$x

# [1] "Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]"

ADD COMMENT • link 2.2 years ago by Dunois ★ 2.5k

1

Entering edit mode

Thank you for answer. how can i apply this method to whole file?

ADD REPLY • link 2.2 years ago by logbio ▴ 30

1

Entering edit mode

I assumed you had your file imported into R already, as a data.frame or something to that effect.

So if it's just a FASTA file you need to manipulate in general, and you're not bound to R, here's a solution assuming you're working in a Unix-like environment (e.g., Ubuntu, off of which I am basing the rest of the explanation here).

I'm assuming you have all your sequences in a file named input.fasta, which looks something like this:

>Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]
ATGCATGCGTGTGTGTGG
>Escherichia_coli_gi|122937398|ref|NP_001078989.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Escherichia coli]
ATGCATGCAGAGAGAGAG

To update your FASTA headers the way you've indicated in the OP, go to the command line, and execute this:

sed -r 's/^>(.*([A-Z]+[a-z]+) ([a-z]+))/>\2_\3_\1/g' input.fasta > output.fasta

input.fasta is the input to the command line utility sed, and your output will be stored in a file called output.fasta, which will look like this:

>Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]
ATGCATGCGTGTGTGTGG
>Escherichia_coli_gi|122937398|ref|NP_001078989.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Escherichia coli]
ATGCATGCAGAGAGAGAG

I assume this is what you want?

Note: output.fasta will be stored wherever you're running sed from within the file system tree . To check where you are (from the command line) type in pwd, and it should indicate your current location as a path. Ideally what you want to do -- if you're inexperienced with this -- is to use your GUI file browser to navigate to the directory/folder where input.fasta is located, and launch the command line terminal from there (right click -> "Open in Terminal" in Ubuntu, for example). This way, output.fasta will be located exactly where input.fasta is.

ADD REPLY • link 2.2 years ago by Dunois ★ 2.5k