Compare many fasta-files
0
0
Entering edit mode
3.6 years ago
L_LANKA • 0

Hello!

I have question about compare fasta-files.

Input: I have ~11.000 fasta-files. This files are of different length, for example, some files around 29.000 bp but some files around 28.000 bp.

Output: I would like have info about unique fasta-files.

For example: We have 7 fasta-files. 1,2,5,6 and 7 have the same SNP, so the 1 fasta-file interested me. 1, 3 and 4 samples have unique SNP. In output I have three unique fasta-files.

I was thinking about bash-script, but iles are of different length. I hope that you can help me.

Thank you so much!

fasta compare SNP • 851 views
ADD COMMENT
0
Entering edit mode

if the only difference between those fasta files are the SNPs , you could consider calculating the md5 key for each file (or sequence) and compare those. As soon as there is one char difference between them the md5 key will differ as well, moreover if files or sequence are 100% identical they will have the same md5 key

ADD REPLY
1
Entering edit mode

but iles are of different length

md5 sums won't work.

You could try CD-HIT.

ADD REPLY
0
Entering edit mode

right.

I was under the impression OP also wanted to get similar length stuff together ...

CDHIT is good approach indeed

ADD REPLY
0
Entering edit mode

Thank you so much! I will try to do with CD-HIT.

ADD REPLY

Login before adding your answer.

Traffic: 1675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6