How to move the last 4 characters of all FASTA headers to the beginning?
6.8 years ago
fibar ▴ 90

My fasta headers of my FASTA file go like this:

>M02529:151:000000000-AJBNG:1:1101:20806:3573:133
TGGGGAATTGTTCGCAATGGGCGCAAGCCTGACGACGCAACGCC
>M02529:151:000000000-AJBNG:1:1101:8182:3623:133
TCGAGAATAATTCACAATGGGGGCAACCCTGATGGTGCAACGCCG


The "133" is the sample name, and I need it at the beginning of the header followed by a dot, like this:

>:133.M02529:151:000000000-AJBNG:1:1101:20806:3573
TGGGGAATTGTTCGCAATGGGCGCAAGCCTGACGACGCAACGCC
>:133.M02529:151:000000000-AJBNG:1:1101:8182:3623
TCGAGAATAATTCACAATGGGGGCAACCCTGATGGTGCAACGCCG


I would be glad to get a 'sed' or 'awk' command to do it and modify my FASTA file. Thanks a lot!



Are you sure its always 4 characters and not "the characters after the last semicolon, plus the semicolon itself"?

In this case, the last 4 characters work (they range btw 133-166). However, I also take your suggestion as an option, thanks!

6.8 years ago
$echo -e ">M02529:151:000000000-AJBNG:1:1101:20806:3573:133\nA" | awk -F '[>:]' '/^>/{printf(">:%s.%s:%s:%s:%s:%s:%s:%s\n",$9,$2,$3,$4,$5,$6,$7,\$8);next;} {print;}'

>:133.M02529:151:000000000-AJBNG:1:1101:20806:3573
A

Thanks a lot Pierre. I'm sorry I was not totally clear. Could you please adapt your command to modify an entire FASTA file?

6.8 years ago
Daniel ★ 3.9k

Here's the code:

sed -i 's/>$$.*$$$$....$$/>\2\1/' myfile.fasta


Here's the why, so you can learn for next time:

sed -i        ## The -i flag means "do this on the file rather than outputing to a new one"

's/abc/def/'  ## s/  means to substitute what is within the first set of forward slashes / / with the second (all enclosed in single quotes)

$$12345$$     ## Stuff found between brackets can be referred to by \NUMBER in the next section

>$$.*$$       ## here we say "starts with a greater than symbol '>', then '.*' means 'any character, any number of times', so we collect everything after a '>' into number 1 (for later) by putting it between $$and$$.

$$....$$      ## EXCEPT! for the last four characters. A period '.' means any one character, so doing 4 means put the last 4 characters into number2 for later.

>\2\1         ## Now we say what we want to swap what we matched with, so we want the '>' symbol again, then the second thing we matched '\2' then the first thing '\1'


myfile.fasta ## Lastly, put the file you're working on.

Hope this helps you or others. #LazyFridayAfternoon.

Edit: If you wanted to lose the begining colon, and add a period afterwards, this would work:

sed -i 's/>$$.*$$:$$...$$/>\2.\1/' myfile.fasta


That says the ':' is outside of what you want to keep, then you keep only 3 characters, and then add a '.' after the match in the second half of the expression.



Thanks Daniel!