Question: Compairng multiple sequence and finding similarity.
0
gravatar for EVR
3.6 years ago by
EVR570
Earth
EVR570 wrote:

HI,

I have two fasta files with same seqnames but with slightly different sequence names like follows:

File_1
    >trinity5_comp_3_0_1
    TTATGCATAT
    >trinity5_comp_9735_1_5
    AAAATGATGA
    >trinity5_comp_645_0_2
    TCGAATGCGA
    >trinity5_comp_3169_0_1
    AGGATATTAC
FIle2
    >trinity5_comp_3_0_1
    TTATGCATAT
    >trinity5_comp_9735_1_5
    AAAATGCCGA
    >trinity5_comp_645_0_2
    TAGAATGCGA
    >trinity5_comp_3169_0_1
    AGGATATTAC

I would like compare each sequence of File1 with respect to corresponding sequences in File2 and compute its percentage of similarity like follows:

trinity5_comp_3_0_1 100%
trinity5_comp_9735_1_5 80%
trinity5_comp_645_0_2 90%
trinity5_comp_3169_0_1 100%

I tried using cd-hit-est-2d but sequences are also compared with other sequences rather than its own corresponding sequences in file2. Kindly guide me.

Thanks in advance

ADD COMMENTlink modified 3.6 years ago by geek_y10k • written 3.6 years ago by EVR570

You can blast your sequences, with the flag -max_target_seqs = 1, and also filter by % identity or set an e-value threshold.

ADD REPLYlink written 3.6 years ago by st.ph.n2.5k

But it doesn’t guarantee you that it will blast against its corresponding sequence with the same header

ADD REPLYlink written 3.6 years ago by EVR570

True. He is missing the gene cluster ids for each isoform in the assembly, however. @Tom, why do you have two trinity fasta? Are you comparing two different assemblies?

ADD REPLYlink written 3.6 years ago by st.ph.n2.5k

These two fasta files are generated using different approaches and I want to compare how different are they by comparing their percentage similarity

ADD REPLYlink written 3.6 years ago by EVR570

Do they follow same order always ?

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by geek_y10k

Yes they do follow same order

ADD REPLYlink written 3.6 years ago by EVR570

How about splitting the two files into individual sequences and then compare pair wise manner using cd-hit-est-2d ?

ADD REPLYlink written 3.6 years ago by geek_y10k

Hi, I thought of doing that, but i have around 3000 sequences which could be tiresome. But anyhow I will try that method too

ADD REPLYlink written 3.6 years ago by EVR570

Its not tiresome, I updated my answer.

ADD REPLYlink written 3.6 years ago by geek_y10k
1
gravatar for geek_y
3.6 years ago by
geek_y10k
Barcelona
geek_y10k wrote:

I do not know what is the scale of your data, but something like calculating edit-distance might help you.

from Bio import SeqIO
import editdistance,sys

fa1 = SeqIO.parse("1.fa", "fasta")
fa2 = SeqIO.parse("2.fa", "fasta")

for rec1,rec2 in zip(fa1,fa2):

    if ( rec1.id != rec2.id ):
        print "The sequences are not in order. Exiting"
        sys.exit()

    length=len(rec1.seq)
    dist=editdistance.eval(rec1.seq,rec2.seq)
    print rec1.id,100-((dist*100)/length)

output:

trinity5_comp_3_0_1 100
trinity5_comp_9735_1_5 80
trinity5_comp_645_0_2 90
trinity5_comp_3169_0_1 100

Assumes the sequences are in same order always. If your fasta sequences are very long, this is definitely not a good solution.

If you want to run cd-hit-est-2d on each pair of sequences:

Make a backup of your files.

cp 1.fa 1.fa.bkp
cp 2.fa 2.fa.bkp

Add a prefix to your fasta header. Check how it works for your system.

sed -i '' -e '/^>/s/$/.1/g' 1.fa
sed -i '' -e '/^>/s/$/.2/g' 2.fa

Then split the fasta file into individual sequences.

for fa in {1,2}.fa
do
curl https://raw.githubusercontent.com/gouthamatla/fasta_File_Manipulation/master/SplitFastaFile.py | python - ${fa}
done

Now you have all your sequences split with .1.fasta and .2.fasta suffix.

Then run cd-hit-est-2d in pairwise manner:

for sample in *.1.fasta 
do
base=`basename  ".1.fasta"`
echo "cd-hit-command $base.1.fasta $base.2.fasta" | parallel --jobs 4
done
rm *.1.fasta *.2.fasta

change the --jobs according to available cores.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by geek_y10k

I tried your python command, it worked. but as I am not that familiar with python, I wanted to write it to file, so I used file.write() but it throwed me error saying that TypeError: write() takes exactly 1 argument (2 given). How can make the python script to write to file. If possible a bound python script which takes two fasta file as input and write it into a file.

ADD REPLYlink written 3.6 years ago by EVR570

Python program definitely works but you should know how it is doing. To write to a file,

python script.py > out.txt
ADD REPLYlink written 3.6 years ago by geek_y10k

I already figured out. I tried to delete that comment but I couldnt do it. thanks for your guidance.

ADD REPLYlink written 3.6 years ago by EVR570

Please validate the results with couple of sequences. Take few random pairs of sequences are run cd-hit on them to compare the outputs. It's very important.

ADD REPLYlink written 3.6 years ago by geek_y10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1045 users visited in the last hour