Question: Identify Duplicates In Two Fasta Files But Not Merge Them
1
gravatar for gtho123
4.9 years ago by
gtho123210
New Zealand
gtho123210 wrote:

I have two relatively large FASTA files of ESTs that have similar sequences in them, but they have different IDs. I wish to cycle through every sequence in file A and find the corresponding sequence in file B if it exists.

However I do not wish to merge the files, merely identify corresponding sequences and extract the respective IDs for each file. This way I could analyse each file separately and then compare them easily.

I realize I could do reciprocal BLASTs and take the top hits that agree to create a table, but not wanting to reinvent the wheel I wish to know whether there is an existing program or script that would do this. Any ideas?

Any help would be greatly appreciated.

EDIT: I should clarify. My two files of ESTs come from different sources. I know one is really a whole lot of ESTs assembled together into "putative transcripts". Therefore I do not feel comfortable assuming that the similar sequences are the same length and start and end in the same points in the sequence. I therefore need a method that can identify the similar sequences given these constraints.

fasta duplicates • 2.5k views
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by gtho123210
2
gravatar for umer.zeeshan.ijaz
4.9 years ago by
Glasgow, UK
umer.zeeshan.ijaz1.7k wrote:

I just wrote these one-liners for you:

When exact match is required

<script src="&lt;a href=" 10524556"="">10524556"></script>

When inexact match is required

In such case, probably the only solution I can think of is to cluster your sequences together.

alt text

I have a python package that has an embarassingly-parallel (multithreaded + utilises streaming SIMD extensions) implementation for qgram-based edit distance measurement and is useful for hierarchical-clustering of DNA sequences assuming lengths are SOMEWHAT similar

You use hclust.py script as follows:

$ ./hclust.py -f test.fasta -t 32 Clust_1 C8_Fusobacteriaceae;C28_Fusobacteriaceae;C106_Fusobacteriaceae;C121_Fusobacteriaceae;C146_Fusobacteriaceae;C180_Fusobacteriaceae Clust_2 C3_Verrucomicrobiaceae;C47_Verrucomicrobiaceae;C48_Verrucomicrobiaceae;C126_Verrucomicrobiaceae;C134_Verrucomicrobiaceae;C162_Verrucomicrobiaceae Clust_3 C31_Planctomycetaceae;C137_Planctomycetaceae;C142_Planctomycetaceae Clust_4 C43_Porphyromonadaceae;C55_Porphyromonadaceae;C159_Porphyromonadaceae;C160_Porphyromonadaceae Clust_5 C4_Bacteroidaceae;C11_Bacteroidaceae;C51_Bacteroidaceae;C54_Bacteroidaceae;C64_Bacteroidaceae;C85_Bacteroidaceae;C94_Bacteroidaceae;C98_Bacteroidaceae;C103_Bacteroidaceae;C109_Bacteroidaceae;C131_Bacteroidaceae;C132_Bacteroidaceae;C154_Bacteroidaceae;C175_Bacteroidaceae;C185_Bacteroidaceae;C191_Bacteroidaceae Clust_6 C13_Chlorobiaceae;C22_Chlorobiaceae;C26_Chlorobiaceae;C30_Chlorobiaceae;C46_Chlorobiaceae;C59_Chlorobiaceae;C66_Chlorobiaceae;C70_Chlorobiaceae;C104_Chlorobiaceae;C111_Chlorobiaceae;C135_Chlorobiaceae;C143_Chlorobiaceae;C188_Chlorobiaceae Clust_7 C124_Spirochaetaceae;C169_Spirochaetaceae

It takes a -c cutoff switch so if you use -c 1 then it will cluster all the sequences with an edit distance of 1. You may adjust -c parameter to fulfill your needs

You can also use Robert Edgar's usearch ( http://www.drive5.com/usearch/ ):

./usearch7.0.1001_i86linux32 -usearch_global f1.fasta -db f2.fasta -strand plus -id 0.97 -uc map.uc

It will cluster sequences from f1.fasta around sequences in f2.fasta based on given identity (0.97) i.e. 97% similarity. You can adjust the parameter -id accordingly. The following clustering will be generated in map.uc file. Check my tutorial (UPARSE section) http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Illumina_workflow.html on how to manipulate this file

Best Wishes, Umer

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by umer.zeeshan.ijaz1.7k

This seems a very elegant way to identify duplicates. However I can't be certain my two files contain exact duplicates in terms of length.

ADD REPLYlink written 4.9 years ago by gtho123210

I have updated my original post to answer this issue!

ADD REPLYlink written 4.9 years ago by umer.zeeshan.ijaz1.7k
0
gravatar for Prakki Rama
4.9 years ago by
Prakki Rama2.2k
Singapore
Prakki Rama2.2k wrote:

I tried the following perl script,

used How To Remove The Same Sequences In The Fasta Files?'s script according to my needs.

   ####-----Reformatted the fasta file into one line fasta format first---###
   sed -e '/^>/s/$/@/' -e 's/^>/#/' a1.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a1.fasta
   sed -e '/^>/s/$/@/' -e 's/^>/#/' a2.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a2.fasta

SCRIPT

open FH,"re_a1.fasta";
%Hash=();
while(<FH>)
{
    $header1=$_;
    $seq1=<FH>;
    $Hash{$header1}="$seq1";
}    

    open HF,"re_a2.fasta";
    while(<HF>)
    {
        $header2=$_;
        $seq2=<HF>;
        while(($key,$value) = each (%Hash))
        {
            if($value eq $seq2)
        {
        chomp($key);chomp($header2);
        $key =~ s/>//;  ##remove >
        $header2 =~ s/>//; ##remove >
        print "\"$key\" in FILE1 and \"$header2\" in FILE2 are exactly the same\n";
        }
        }
    }
    close(FH);
    close(HF);



 ___FILE1_____
    >1
    ATGCTCGTATATGCGATCGA
    >2
    GATGCTAGTGCATGTATATA
    >3
    TATCTAGCTAGATAGCTAGTA

     ___FILE2_____
    >seq1
   GATGCTAGTGCATGTATATA
   >seq3
   GCTGATCCACAGATCGGCT
   >seq2
   GCTGCATGCTGTATGCGCCGTAGTC

OUTPUT

 "2" in FILE1 and "seq1" in FILE2 are exactly the same
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Prakki Rama2.2k

Interesting. While sequences in my two files are similar, I cannot assume that they are the same length. Consequently they may not start and end in the same place. Perhaps I was not clear enough, if so I apologize. However I need something a little more sensitive than the "eq" operator. If I could come up with something that could do this and swap it wit the "eq" operator then this script would be great.

ADD REPLYlink written 4.9 years ago by gtho123210

In the case you do not have sequences of same length and you are not expecting 100% match, then i suggest trying CD-HIT-EST from weizhong lab, where you can vary the parameters such as length of shorter sequence and its identity when matched to longer sequence. And as Umer has suggested Usearch is also helpful.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Prakki Rama2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1817 users visited in the last hour