How to change sequence description automatically in an MSA
1
0
Entering edit mode
3.2 years ago
jbt38 • 0

I have a MSA (fasta format) with hundreds of sequences , and the descriptions are this format:

>gi|AY015275.1|taxonid|154401|organism|Leuenbergeria guamacho|seqid|AY015275.1|description|Pereskia guamacho tRNA-Lys (trnK) gene partial sequence; and maturase K (matK) gene complete cds; chloroplast genes for chloroplast products

How can I change the description of each entry to look like this?

>Leuenbergeria_guamacho

Edited to add an underscore between genus and species.

sequence • 703 views
ADD COMMENT
0
Entering edit mode

Assuming that scientific name is always sandwiched between organism and seqid:

$ seqkit seq test.fa -w 0  -i --id-regexp ".*organism\|(.*)\|seqid.*" | seqkit replace -p " " -r "_"

with sed:

$ sed -r '/^>/ s/(>).*organism\|(.*)\s(.*)\|seqid.*/\1\2_\3/' input.fa
ADD REPLY
1
Entering edit mode
3.2 years ago
cschu181 ★ 2.8k
awk -F '|' '/^>/ { print ">"$6; next; } { print $0; }' fasta_file

You also might want to replace the space with a _

awk -F '|' '/^>/ { print ">"$6; next; } { print $0; }' fasta_file | tr " " "_"
ADD COMMENT
1
Entering edit mode

$ awk -F '|' '/^>/ {gsub(/ /,"_",$6);print ">"$6;next}1' fasta_file may suffice to print scientific name separated by _

ADD REPLY
1
Entering edit mode

fair enough, I struggle with awk internals..

ADD REPLY
0
Entering edit mode

@cshu181 My awk fundamentals came from Biostars like Pierre, Kevin and my struggle is as good as yours, if not worse.

ADD REPLY

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6