2
1
Entering edit mode
6.3 years ago
>a
ACTCTAAAT

>b
AAAAACCCT


etc.

To

>a_1
ACTCTAAAT

>b_2
AAAAACCCT

awk '/^>/{$0=$0"_"(++i)}1'  in > out

genome • 4.5k views
0
Entering edit mode

Could you expand a bit on your post? What's the purpose of doing this?

0
Entering edit mode

Soory, I just want to record this. thanks for you concern, I will explain more for next time.

ZQ

3
Entering edit mode
6.3 years ago

Another way to do it, which works with single-line FASTA input:

$awk 'BEGIN{RS=">"}{if(NR>1)print ">"$1"_"(NR-1)"\n"$2}' input.fa > output.fa  A second way, which allows multiline FASTA input: $ awk 'BEGIN{RS=">";OFS="\n"}(NR>1){print ">"$1"_"(NR-1)"\n";$1="";print $0}' input.fa | awk '$0' > output.fa

0
Entering edit mode

Hi - I know this post is quite old now, but I tried the above code for the multi-line fasta and it produced a fasta file that had only the names and then the number, but none of the sequences. I would like to add the numbers to the headers, but keeping the sequences intact in the output file - Is there any way to do this? I have been trying to resolve a makeblastdb error saying I have duplicate seqids and I was hoping this approach might resolve the error.

Edit: I tried the line from Alex Reynolds: awk '/^>/{$0=$0"_"(++i)}1' in > out
And it successfully added a number at the end of the description line, but I am trying to add the number addition directly to the end of the sequence ID number since adding it to the description didn't seem to help my error on makeblastdb.

Eg: I want: " >CP064824.1_1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...." Instead of: ">CP064824.1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence_1 TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...."

Thanks!

0
Entering edit mode

Given input.fa:

>CP064824.1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC


Note the two Klebsiella pneumoniae entries.

The following will append an incremented counter to the sequence ID:

$awk 'BEGIN{FS=" ";RS=">"}{if(NR>1){ a[$1]++; h=""; for(i=2; i<NF; i++) { h=h" "$i; } print ">"$1"_"a[$1]h"\n"$NF; }}' input.fa
>CP064824.1_1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6_1 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1_2 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC


For multiline FASTA, you'd need to make modifications, but hopefully this gives you some ideas.

0
Entering edit mode
6.3 years ago
venu 7.0k

You can do something like following

cat file.fa | paste - - | awk '{print $1"_"NR"\n"$2}' > new_file.fa

0
Entering edit mode

I don't think it's a question, but that he is sharing a way of doing just this.

2
Entering edit mode

Oops! I think I was too hurry as I am busy with our Biostars Handbook.

0
Entering edit mode

Best excuse I can imagine, keep up the good work.

0
Entering edit mode

If you put this in the Handbook, make sure to use the OPs awk as it works with all FASTA files and not just ones where the sequence is less than 120 characters! :) (or don't put awk in at all, because awk is the devil! :P)

0
Entering edit mode

thanks for the new way to do this. ZQ