Question: How to add the suffix if the entries are the same in fasta file
0
gravatar for horsedog
19 months ago by
horsedog30
horsedog30 wrote:

I got a bunch of genome sequences in the same fie named sequence.fasta but some of them have the exact same names, like this:

> Rhodobacter_sphaeroides_2.4.1_chromosome_2
ATGAGCTTTCCCCATTTCGCGGCCCTCTTCCGGCCCTCGCAGTTCTTCGGCATCCGCGGCGGCGTCCACCCCGAGACGCG
>Rhodobacter_sphaeroides_2.4.1_chromosome_2
GTGCAGGTGGTGCCGACCCAGTATCCGATGGGCTCGGAGAAGCATCTGGTGAAGATCCTGACCGGGCGCGAGACGCCGGC

Is there any way to detect those sequences with the same name and add suffix automatically, so i can distinguish. this is what i want:

> Rhodobacter_sphaeroides_2.4.1_chromosome_2.1
ATGAGCTTTCCCCATTTCGCGGCCCTCTTCCGGCCCTCGCAGTTCTTCGGCATCCGCGGCGGCGTCCACCCCGAGACGCG
> Rhodobacter_sphaeroides_2.4.1_chromosome_2.2
GTGCAGGTGGTGCCGACCCAGTATCCGATGGGCTCGGAGAAGCATCTGGTGAAGATCCTGACCGGGCGCGAGACGCCGGC

But for those who have unique names just leave them.

Thanks a lot!

sequencing gene • 515 views
ADD COMMENTlink modified 19 months ago by Brian Bushnell16k • written 19 months ago by horsedog30

before the name there is a ">" so it's like this

Rhodobacter_sphaeroides_2.4.1_chromosome_2

ADD REPLYlink written 19 months ago by horsedog30

http://bioinf.shenwei.me/seqkit/usage/#rename

seqkit rename seqs.fa > new.fa
ADD REPLYlink written 19 months ago by shenwei3564.6k
0
gravatar for Pierre Lindenbaum
19 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

linearize, sort, count the uniq names:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | sort -t $'\t' -k1,1 | awk -F '\t' 'BEGIN{N=0;prev="";}{if(prev==$1) { N++;} else {N=1;} printf("%s.%d\n%s\n",$1,N,$2);prev=$1;}'


>1_anotherUniqueGeneName.1
atgc
>1_duplicateName.1
atgc
>1_duplicateName.2
atgc
>1_uniqueGeneName.1
atgc
ADD COMMENTlink written 19 months ago by Pierre Lindenbaum119k

Thank you very much!

ADD REPLYlink written 19 months ago by horsedog30
0
gravatar for Brian Bushnell
19 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

With BBMap's reformat.sh:

reformat.sh in=file.fa out=fixed.fa uniquenames

That appends "_2", "_3", etc to the second and 3rd instance of a name. The first time a name occurs it will be unaffected.

ADD COMMENTlink written 19 months ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2108 users visited in the last hour