Question

Identify Duplicates In Two Fasta Files But Not Merge Them

1

Entering edit mode

10.0 years ago

gtho123 ▴ 260

I have two relatively large FASTA files of ESTs that have similar sequences in them, but they have different IDs. I wish to cycle through every sequence in file A and find the corresponding sequence in file B if it exists.

However I do not wish to merge the files, merely identify corresponding sequences and extract the respective IDs for each file. This way I could analyse each file separately and then compare them easily.

I realize I could do reciprocal BLASTs and take the top hits that agree to create a table, but not wanting to reinvent the wheel I wish to know whether there is an existing program or script that would do this. Any ideas?

Any help would be greatly appreciated.

EDIT: I should clarify. My two files of ESTs come from different sources. I know one is really a whole lot of ESTs assembled together into "putative transcripts". Therefore I do not feel comfortable assuming that the similar sequences are the same length and start and end in the same points in the sequence. I therefore need a method that can identify the similar sequences given these constraints.

duplicates fasta • 4.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 10.0 years ago by gtho123 ▴ 260

Ram · Answer 1 · 2014-04-12

I just wrote these one-liners for you:

When exact match is required

When inexact match is required

In such case, probably the only solution I can think of is to cluster your sequences together.

alt text

I have a python package that has an embarassingly-parallel (multithreaded + utilises streaming SIMD extensions) implementation for qgram-based edit distance measurement and is useful for hierarchical-clustering of DNA sequences assuming lengths are SOMEWHAT similar

Package: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/CLUSThack_v0.2.tar.gz
Once the package is installed, use this script to cluster sequences:
Test data: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/test.fasta
Test data output: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/test.pdf

You use hclust.py script as follows:

$ ./hclust.py -f test.fasta -t 32
Clust_1 C8_Fusobacteriaceae;C28_Fusobacteriaceae;C106_Fusobacteriaceae;C121_Fusobacteriaceae;C146_Fusobacteriaceae;C180_Fusobacteriaceae
Clust_2 C3_Verrucomicrobiaceae;C47_Verrucomicrobiaceae;C48_Verrucomicrobiaceae;C126_Verrucomicrobiaceae;C134_Verrucomicrobiaceae;C162_Verrucomicrobiaceae
Clust_3 C31_Planctomycetaceae;C137_Planctomycetaceae;C142_Planctomycetaceae
Clust_4 C43_Porphyromonadaceae;C55_Porphyromonadaceae;C159_Porphyromonadaceae;C160_Porphyromonadaceae
Clust_5 C4_Bacteroidaceae;C11_Bacteroidaceae;C51_Bacteroidaceae;C54_Bacteroidaceae;C64_Bacteroidaceae;C85_Bacteroidaceae;C94_Bacteroidaceae;C98_Bacteroidaceae;C103_Bacteroidaceae;C109_Bacteroidaceae;C131_Bacteroidaceae;C132_Bacteroidaceae;C154_Bacteroidaceae;C175_Bacteroidaceae;C185_Bacteroidaceae;C191_Bacteroidaceae
Clust_6 C13_Chlorobiaceae;C22_Chlorobiaceae;C26_Chlorobiaceae;C30_Chlorobiaceae;C46_Chlorobiaceae;C59_Chlorobiaceae;C66_Chlorobiaceae;C70_Chlorobiaceae;C104_Chlorobiaceae;C111_Chlorobiaceae;C135_Chlorobiaceae;C143_Chlorobiaceae;C188_Chlorobiaceae
Clust_7 C124_Spirochaetaceae;C169_Spirochaetaceae

It takes a -c cutoff switch so if you use -c 1 then it will cluster all the sequences with an edit distance of 1. You may adjust -c parameter to fulfill your needs

You can also use Robert Edgar's usearch.

./usearch7.0.1001_i86linux32 -usearch_global f1.fasta -db f2.fasta -strand plus -id 0.97 -uc map.uc

It will cluster sequences from f1.fasta around sequences in f2.fasta based on given identity (0.97) i.e. 97% similarity. You can adjust the parameter -id accordingly. The following clustering will be generated in map.uc file. Check my tutorial (UPARSE section) on how to manipulate this file

Best Wishes, Umer

score 0 · Answer 2 · 2014-04-12

I tried the following perl script,

used How To Remove The Same Sequences In The Fasta Files?'s script according to my needs.

   ####-----Reformatted the fasta file into one line fasta format first---###
   sed -e '/^>/s/$/@/' -e 's/^>/#/' a1.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a1.fasta
   sed -e '/^>/s/$/@/' -e 's/^>/#/' a2.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a2.fasta

SCRIPT

open FH,"re_a1.fasta";
%Hash=();
while(<FH>)
{
    $header1=$_;
    $seq1=<FH>;
    $Hash{$header1}="$seq1";
}    

    open HF,"re_a2.fasta";
    while(<HF>)
    {
        $header2=$_;
        $seq2=<HF>;
        while(($key,$value) = each (%Hash))
        {
            if($value eq $seq2)
        {
        chomp($key);chomp($header2);
        $key =~ s/>//;  ##remove >
        $header2 =~ s/>//; ##remove >
        print "\"$key\" in FILE1 and \"$header2\" in FILE2 are exactly the same\n";
        }
        }
    }
    close(FH);
    close(HF);



 ___FILE1_____
    >1
    ATGCTCGTATATGCGATCGA
    >2
    GATGCTAGTGCATGTATATA
    >3
    TATCTAGCTAGATAGCTAGTA

     ___FILE2_____
    >seq1
   GATGCTAGTGCATGTATATA
   >seq3
   GCTGATCCACAGATCGGCT
   >seq2
   GCTGCATGCTGTATGCGCCGTAGTC

OUTPUT

 "2" in FILE1 and "seq1" in FILE2 are exactly the same