Identify Duplicates In Two Fasta Files But Not Merge Them
2
1
Entering edit mode
8.8 years ago
gtho123 ▴ 230

I have two relatively large FASTA files of ESTs that have similar sequences in them, but they have different IDs. I wish to cycle through every sequence in file A and find the corresponding sequence in file B if it exists.

However I do not wish to merge the files, merely identify corresponding sequences and extract the respective IDs for each file. This way I could analyse each file separately and then compare them easily.

I realize I could do reciprocal BLASTs and take the top hits that agree to create a table, but not wanting to reinvent the wheel I wish to know whether there is an existing program or script that would do this. Any ideas?

Any help would be greatly appreciated.

EDIT: I should clarify. My two files of ESTs come from different sources. I know one is really a whole lot of ESTs assembled together into "putative transcripts". Therefore I do not feel comfortable assuming that the similar sequences are the same length and start and end in the same points in the sequence. I therefore need a method that can identify the similar sequences given these constraints.

duplicates fasta • 3.8k views
2
Entering edit mode
8.8 years ago

I just wrote these one-liners for you:

## When inexact match is required

In such case, probably the only solution I can think of is to cluster your sequences together.

I have a python package that has an embarassingly-parallel (multithreaded + utilises streaming SIMD extensions) implementation for qgram-based edit distance measurement and is useful for hierarchical-clustering of DNA sequences assuming lengths are SOMEWHAT similar

You use hclust.py script as follows:

$./hclust.py -f test.fasta -t 32 Clust_1 C8_Fusobacteriaceae;C28_Fusobacteriaceae;C106_Fusobacteriaceae;C121_Fusobacteriaceae;C146_Fusobacteriaceae;C180_Fusobacteriaceae Clust_2 C3_Verrucomicrobiaceae;C47_Verrucomicrobiaceae;C48_Verrucomicrobiaceae;C126_Verrucomicrobiaceae;C134_Verrucomicrobiaceae;C162_Verrucomicrobiaceae Clust_3 C31_Planctomycetaceae;C137_Planctomycetaceae;C142_Planctomycetaceae Clust_4 C43_Porphyromonadaceae;C55_Porphyromonadaceae;C159_Porphyromonadaceae;C160_Porphyromonadaceae Clust_5 C4_Bacteroidaceae;C11_Bacteroidaceae;C51_Bacteroidaceae;C54_Bacteroidaceae;C64_Bacteroidaceae;C85_Bacteroidaceae;C94_Bacteroidaceae;C98_Bacteroidaceae;C103_Bacteroidaceae;C109_Bacteroidaceae;C131_Bacteroidaceae;C132_Bacteroidaceae;C154_Bacteroidaceae;C175_Bacteroidaceae;C185_Bacteroidaceae;C191_Bacteroidaceae Clust_6 C13_Chlorobiaceae;C22_Chlorobiaceae;C26_Chlorobiaceae;C30_Chlorobiaceae;C46_Chlorobiaceae;C59_Chlorobiaceae;C66_Chlorobiaceae;C70_Chlorobiaceae;C104_Chlorobiaceae;C111_Chlorobiaceae;C135_Chlorobiaceae;C143_Chlorobiaceae;C188_Chlorobiaceae Clust_7 C124_Spirochaetaceae;C169_Spirochaetaceae It takes a -c cutoff switch so if you use -c 1 then it will cluster all the sequences with an edit distance of 1. You may adjust -c parameter to fulfill your needs You can also use Robert Edgar's usearch. ./usearch7.0.1001_i86linux32 -usearch_global f1.fasta -db f2.fasta -strand plus -id 0.97 -uc map.uc It will cluster sequences from f1.fasta around sequences in f2.fasta based on given identity (0.97) i.e. 97% similarity. You can adjust the parameter -id accordingly. The following clustering will be generated in map.uc file. Check my tutorial (UPARSE section) on how to manipulate this file Best Wishes, Umer ADD COMMENT 0 Entering edit mode This seems a very elegant way to identify duplicates. However I can't be certain my two files contain exact duplicates in terms of length. ADD REPLY 0 Entering edit mode I have updated my original post to answer this issue! ADD REPLY 0 Entering edit mode 8.8 years ago Prakki Rama ★ 2.6k I tried the following perl script, used How To Remove The Same Sequences In The Fasta Files?'s script according to my needs. ####-----Reformatted the fasta file into one line fasta format first---### sed -e '/^>/s/$/@/' -e 's/^>/#/' a1.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a1.fasta sed -e '/^>/s/$/@/' -e 's/^>/#/' a2.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a2.fasta SCRIPT open FH,"re_a1.fasta"; %Hash=(); while(<FH>) {$header1=$_;$seq1=<FH>;
$Hash{$header1}="$seq1"; } open HF,"re_a2.fasta"; while(<HF>) {$header2=$_;$seq2=<HF>;
while(($key,$value) = each (%Hash))
{
if($value eq$seq2)
{
chomp($key);chomp($header2);
$key =~ s/>//; ##remove >$header2 =~ s/>//; ##remove >
print "\"$key\" in FILE1 and \"$header2\" in FILE2 are exactly the same\n";
}
}
}
close(FH);
close(HF);

___FILE1_____
>1
ATGCTCGTATATGCGATCGA
>2
GATGCTAGTGCATGTATATA
>3
TATCTAGCTAGATAGCTAGTA

___FILE2_____
>seq1
GATGCTAGTGCATGTATATA
>seq3
GCTGATCCACAGATCGGCT
>seq2
GCTGCATGCTGTATGCGCCGTAGTC

OUTPUT

"2" in FILE1 and "seq1" in FILE2 are exactly the same
0
Entering edit mode

Interesting. While sequences in my two files are similar, I cannot assume that they are the same length. Consequently they may not start and end in the same place. Perhaps I was not clear enough, if so I apologize. However I need something a little more sensitive than the "eq" operator. If I could come up with something that could do this and swap it wit the "eq" operator then this script would be great.

0
Entering edit mode

In the case you do not have sequences of same length and you are not expecting 100% match, then i suggest trying CD-HIT-EST from weizhong lab, where you can vary the parameters such as length of shorter sequence and its identity when matched to longer sequence. And as Umer has suggested Usearch is also helpful.