Question: How to move the last 4 characters of all FASTA headers to the beginning?
0
gravatar for fibar
4.7 years ago by
fibar50
Argentina
fibar50 wrote:

My fasta headers of my FASTA file go like this:

>M02529:151:000000000-AJBNG:1:1101:20806:3573:133
TGGGGAATTGTTCGCAATGGGCGCAAGCCTGACGACGCAACGCC
>M02529:151:000000000-AJBNG:1:1101:8182:3623:133
TCGAGAATAATTCACAATGGGGGCAACCCTGATGGTGCAACGCCG

The "133" is the sample name, and I need it at the beginning of the header followed by a dot, like this:

>:133.M02529:151:000000000-AJBNG:1:1101:20806:3573
TGGGGAATTGTTCGCAATGGGCGCAAGCCTGACGACGCAACGCC
>:133.M02529:151:000000000-AJBNG:1:1101:8182:3623
TCGAGAATAATTCACAATGGGGGCAACCCTGATGGTGCAACGCCG

I would be glad to get a 'sed' or 'awk' command to do it and modify my FASTA file. Thanks a lot!

Cheers!

sample name myposts header fasta • 1.5k views
ADD COMMENTlink modified 4.7 years ago by Daniel3.8k • written 4.7 years ago by fibar50

Are you sure its always 4 characters and not "the characters after the last semicolon, plus the semicolon itself"?

ADD REPLYlink modified 8 months ago by RamRS30k • written 4.7 years ago by John12k

In this case, the last 4 characters work (they range btw 133-166). However, I also take your suggestion as an option, thanks!

ADD REPLYlink written 4.7 years ago by fibar50
2
gravatar for Pierre Lindenbaum
4.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:
$ echo -e ">M02529:151:000000000-AJBNG:1:1101:20806:3573:133\nA" | awk -F '[>:]' '/^>/{printf(">:%s.%s:%s:%s:%s:%s:%s:%s\n",$9,$2,$3,$4,$5,$6,$7,$8);next;} {print;}'

>:133.M02529:151:000000000-AJBNG:1:1101:20806:3573
A
ADD COMMENTlink modified 8 months ago by RamRS30k • written 4.7 years ago by Pierre Lindenbaum130k

Thanks a lot Pierre. I'm sorry I was not totally clear. Could you please adapt your command to modify an entire FASTA file?

ADD REPLYlink written 4.7 years ago by fibar50
2
gravatar for Daniel
4.7 years ago by
Daniel3.8k
Cardiff University
Daniel3.8k wrote:

Here's the code:

sed -i 's/>\(.*\)\(....\)/>\2\1/' myfile.fasta

Here's the why, so you can learn for next time:

sed -i        ## The -i flag means "do this on the file rather than outputing to a new one"

's/abc/def/'  ## s/  means to substitute what is within the first set of forward slashes / / with the second (all enclosed in single quotes)

\(12345\)     ## Stuff found between brackets can be referred to by \NUMBER in the next section

>\(.*\)       ## here we say "starts with a greater than symbol '>', then '.*' means 'any character, any number of times', so we collect everything after a '>' into number 1 (for later) by putting it between \( and \).

\(....\)      ## EXCEPT! for the last four characters. A period '.' means any one character, so doing 4 means put the last 4 characters into number2 for later.

>\2\1         ## Now we say what we want to swap what we matched with, so we want the '>' symbol again, then the second thing we matched '\2' then the first thing '\1'

myfile.fasta ## Lastly, put the file you're working on.

Hope this helps you or others. #LazyFridayAfternoon.

Edit: If you wanted to lose the begining colon, and add a period afterwards, this would work:

sed -i 's/>\(.*\):\(...\)/>\2.\1/' myfile.fasta

That says the ':' is outside of what you want to keep, then you keep only 3 characters, and then add a '.' after the match in the second half of the expression.

#SuperLazyFridayAfternoon

ADD COMMENTlink modified 8 months ago by RamRS30k • written 4.7 years ago by Daniel3.8k

Thanks Daniel!

ADD REPLYlink written 4.7 years ago by fibar50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1546 users visited in the last hour