Append fasta header to corresponding fasta filename
0
0
Entering edit mode
3.7 years ago

Hey everyone, When you donwload a given assembly from Refseq NCBI, the filename with be for example GCF_006351845.1_ASM635184v1_genomic.fna and the corresponding fasta header

>NZ_CP040904.1 Enterococcus faecium strain N56454 chromosome, complete genome

After some formatting, all my fasta headers are like this, for example:

>NZ_CP040904.1_Ef

I would like to rename my filename like this Ef_GCF_006351845.1_ASM635184v1_genomic.fna. So, copying the text after the last underscore on the fasta header, and moving it to the beginning of the filename.

Could you guys help me out?

Thanks!

sequence • 889 views
ADD COMMENT
0
Entering edit mode

Here's some logic to approach the problem:

For each of these files, you should pick the first line, cut out the second part where each part is separated by _ and store that part in a variable. Now you should rename the file so this variable precedes the actual file name. This can be done in a loop that contains two commands. bash should do this, you won't need any programming language.

ADD REPLY
0
Entering edit mode

I'm no expert in this, but I wrote this

for F in *.fna ; do N=$(awk -F '>|_' '/^>/ {print $4}' $F) ; echo mv -v $F $N_$F ; done

I understand it should be something similar to this, but I'm making some mistakes. Could you help me out?

ADD REPLY
1
Entering edit mode

Change cp to mv to rename instead of copy.

for F in $(find . -name "*.fna" -printf "%f\n"); do
  N=$(head -n1 $F | cut -d"_" -f3)_$F
  cp $F $N
done
ADD REPLY

Login before adding your answer.

Traffic: 3205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6