How to move the last 4 characters of all FASTA headers to the beginning?
2
0
Entering edit mode
8.9 years ago
fibar ▴ 90

My fasta headers of my FASTA file go like this:

>M02529:151:000000000-AJBNG:1:1101:20806:3573:133
TGGGGAATTGTTCGCAATGGGCGCAAGCCTGACGACGCAACGCC
>M02529:151:000000000-AJBNG:1:1101:8182:3623:133
TCGAGAATAATTCACAATGGGGGCAACCCTGATGGTGCAACGCCG

The "133" is the sample name, and I need it at the beginning of the header followed by a dot, like this:

>:133.M02529:151:000000000-AJBNG:1:1101:20806:3573
TGGGGAATTGTTCGCAATGGGCGCAAGCCTGACGACGCAACGCC
>:133.M02529:151:000000000-AJBNG:1:1101:8182:3623
TCGAGAATAATTCACAATGGGGGCAACCCTGATGGTGCAACGCCG

I would be glad to get a 'sed' or 'awk' command to do it and modify my FASTA file. Thanks a lot!

Cheers!

header fasta • 3.2k views
ADD COMMENT
0
Entering edit mode

Are you sure its always 4 characters and not "the characters after the last semicolon, plus the semicolon itself"?

ADD REPLY
0
Entering edit mode

In this case, the last 4 characters work (they range btw 133-166). However, I also take your suggestion as an option, thanks!

ADD REPLY
2
Entering edit mode
8.9 years ago
$ echo -e ">M02529:151:000000000-AJBNG:1:1101:20806:3573:133\nA" | awk -F '[>:]' '/^>/{printf(">:%s.%s:%s:%s:%s:%s:%s:%s\n",$9,$2,$3,$4,$5,$6,$7,$8);next;} {print;}'

>:133.M02529:151:000000000-AJBNG:1:1101:20806:3573
A
ADD COMMENT
0
Entering edit mode

Thanks a lot Pierre. I'm sorry I was not totally clear. Could you please adapt your command to modify an entire FASTA file?

ADD REPLY
2
Entering edit mode
8.9 years ago
Daniel ★ 4.0k

Here's the code:

sed -i 's/>\(.*\)\(....\)/>\2\1/' myfile.fasta

Here's the why, so you can learn for next time:

sed -i        ## The -i flag means "do this on the file rather than outputing to a new one"

's/abc/def/'  ## s/  means to substitute what is within the first set of forward slashes / / with the second (all enclosed in single quotes)

\(12345\)     ## Stuff found between brackets can be referred to by \NUMBER in the next section

>\(.*\)       ## here we say "starts with a greater than symbol '>', then '.*' means 'any character, any number of times', so we collect everything after a '>' into number 1 (for later) by putting it between \( and \).

\(....\)      ## EXCEPT! for the last four characters. A period '.' means any one character, so doing 4 means put the last 4 characters into number2 for later.

>\2\1         ## Now we say what we want to swap what we matched with, so we want the '>' symbol again, then the second thing we matched '\2' then the first thing '\1'

myfile.fasta ## Lastly, put the file you're working on.

Hope this helps you or others. #LazyFridayAfternoon.

Edit: If you wanted to lose the begining colon, and add a period afterwards, this would work:

sed -i 's/>\(.*\):\(...\)/>\2.\1/' myfile.fasta

That says the ':' is outside of what you want to keep, then you keep only 3 characters, and then add a '.' after the match in the second half of the expression.

#SuperLazyFridayAfternoon

ADD COMMENT
0
Entering edit mode

Thanks Daniel!

ADD REPLY

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6