Question: How to add the suffix if the entries are the same in fasta file
0
gravatar for horsedog
2.1 years ago by
horsedog50
horsedog50 wrote:

I got a bunch of genome sequences in the same fie named sequence.fasta but some of them have the exact same names, like this:

> Rhodobacter_sphaeroides_2.4.1_chromosome_2
ATGAGCTTTCCCCATTTCGCGGCCCTCTTCCGGCCCTCGCAGTTCTTCGGCATCCGCGGCGGCGTCCACCCCGAGACGCG
>Rhodobacter_sphaeroides_2.4.1_chromosome_2
GTGCAGGTGGTGCCGACCCAGTATCCGATGGGCTCGGAGAAGCATCTGGTGAAGATCCTGACCGGGCGCGAGACGCCGGC

Is there any way to detect those sequences with the same name and add suffix automatically, so i can distinguish. this is what i want:

> Rhodobacter_sphaeroides_2.4.1_chromosome_2.1
ATGAGCTTTCCCCATTTCGCGGCCCTCTTCCGGCCCTCGCAGTTCTTCGGCATCCGCGGCGGCGTCCACCCCGAGACGCG
> Rhodobacter_sphaeroides_2.4.1_chromosome_2.2
GTGCAGGTGGTGCCGACCCAGTATCCGATGGGCTCGGAGAAGCATCTGGTGAAGATCCTGACCGGGCGCGAGACGCCGGC

But for those who have unique names just leave them.

Thanks a lot!

sequencing gene • 617 views
ADD COMMENTlink modified 2.1 years ago by Brian Bushnell16k • written 2.1 years ago by horsedog50

before the name there is a ">" so it's like this

Rhodobacter_sphaeroides_2.4.1_chromosome_2

ADD REPLYlink written 2.1 years ago by horsedog50

http://bioinf.shenwei.me/seqkit/usage/#rename

seqkit rename seqs.fa > new.fa
ADD REPLYlink written 2.1 years ago by shenwei3564.8k
0
gravatar for Pierre Lindenbaum
2.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

linearize, sort, count the uniq names:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | sort -t $'\t' -k1,1 | awk -F '\t' 'BEGIN{N=0;prev="";}{if(prev==$1) { N++;} else {N=1;} printf("%s.%d\n%s\n",$1,N,$2);prev=$1;}'


>1_anotherUniqueGeneName.1
atgc
>1_duplicateName.1
atgc
>1_duplicateName.2
atgc
>1_uniqueGeneName.1
atgc
ADD COMMENTlink written 2.1 years ago by Pierre Lindenbaum123k

Thank you very much!

ADD REPLYlink written 2.1 years ago by horsedog50
0
gravatar for Brian Bushnell
2.1 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

With BBMap's reformat.sh:

reformat.sh in=file.fa out=fixed.fa uniquenames

That appends "_2", "_3", etc to the second and 3rd instance of a name. The first time a name occurs it will be unaffected.

ADD COMMENTlink written 2.1 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 778 users visited in the last hour