Question

Remove Fasta Sequences with Duplicate IDs (but with different Descriptions) & Append Different Descriptions

0

Entering edit mode

7.6 years ago

tanz.renner • 0

Hello,

My first post, so I hope I'm posting this in the correct place!

I have ~100k fasta sequences - some with duplicate fasta IDs (they also have identical sequences), but with unique descriptions. I would like to extract unique fasta sequences based on ID (so, remove duplicates, but keep one representative sequence), but also append the description associated with the duplicates.

For example, my fasta file might contain the following 3 sequences:

>Contig1
ATGCGAGTAG

>Contig1 Description1
ATGCGAGTAG

>Contig1 Description2
ATGCGAGTAG

And I'm looking to obtain the following single sequence:

>Contig1 Description1 Description2
ATGCGAGTAG

Thanks for any help :)

RNA-Seq sequence • 3.1k views

ADD COMMENT • link updated 7.6 years ago by baxy ▴ 170 • written 7.6 years ago by tanz.renner • 0

0

Entering edit mode

I have been trying to use fasuniq, but this only can concatenate the IDs of duplicated sequences.

ADD REPLY • link 7.6 years ago by tanz.renner • 0

0

Entering edit mode

While the dedeuplication part can be achieved by different programs dedupe.sh from BBMap suite is one) if you must have the descriptions appended to the deduped sequence then that would require a specific solution.

ADD REPLY • link 7.6 years ago by GenoMax 141k

score 5 · Answer 1 · 2016-09-06

5

Entering edit mode

7.6 years ago

baxy ▴ 170

Quick solution under Linux/Perl

perl -ne 'if (/>(.*?)\s+(.*)/){push(@{$hash{$1}},$2) ;}}{open(I, "<","test.fa");while(<I>){if(/>(.*?)\s+/){ $t = 0; next if $h{$1}; $h{$1} = 1 if $hash{$1}; $t = 1; chomp; print $_ . " @{$hash{$1}}\n"}elsif($t==1){print $_} } close I;' test.fa

where test.fa is your file (note that the file is defined at two places ) also change the code accordingly in case the separator is a tabulator

ADD COMMENT • link 7.6 years ago by baxy ▴ 170

0

Entering edit mode

This perfectly did the trick - thank you baxy! Brilliant! Now, I need to go study the code you wrote :)