Question: Remove Fasta Sequences with Duplicate IDs (but with different Descriptions) & Append Different Descriptions
0
gravatar for tanz.renner
2.9 years ago by
tanz.renner0 wrote:

Hello,

My first post, so I hope I'm posting this in the correct place!

I have ~100k fasta sequences - some with duplicate fasta IDs (they also have identical sequences), but with unique descriptions. I would like to extract unique fasta sequences based on ID (so, remove duplicates, but keep one representative sequence), but also append the description associated with the duplicates.

For example, my fasta file might contain the following 3 sequences:

>Contig1
ATGCGAGTAG

>Contig1 Description1
ATGCGAGTAG

>Contig1 Description2
ATGCGAGTAG

And I'm looking to obtain the following single sequence:

>Contig1 Description1 Description2
ATGCGAGTAG

Thanks for any help :)

rna-seq sequence • 1.5k views
ADD COMMENTlink modified 2.9 years ago by baxy150 • written 2.9 years ago by tanz.renner0

I have been trying to use fasuniq, but this only can concatenate the IDs of duplicated sequences.

ADD REPLYlink written 2.9 years ago by tanz.renner0

While the dedeuplication part can be achieved by different programs dedupe.sh from BBMap suite is one) if you must have the descriptions appended to the deduped sequence then that would require a specific solution.

ADD REPLYlink written 2.9 years ago by genomax69k
5
gravatar for baxy
2.9 years ago by
baxy150
baxy150 wrote:

Quick solution under Linux/Perl

perl -ne 'if (/>(.*?)\s+(.*)/){push(@{$hash{$1}},$2) ;}}{open(I, "<","test.fa");while(<I>){if(/>(.*?)\s+/){ $t = 0; next if $h{$1}; $h{$1} = 1 if $hash{$1}; $t = 1; chomp; print $_ . " @{$hash{$1}}\n"}elsif($t==1){print $_} } close I;' test.fa

where test.fa is your file (note that the file is defined at two places ) also change the code accordingly in case the separator is a tabulator

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by baxy150

This perfectly did the trick - thank you baxy! Brilliant! Now, I need to go study the code you wrote :)

ADD REPLYlink written 2.9 years ago by tanz.renner0

Any suggestions to work this as a loop for hundreds of files?

ADD REPLYlink written 8 months ago by nmkn0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2119 users visited in the last hour