Entering edit mode
3.7 years ago
L_LANKA
•
0
Hello!
I have question about compare fasta-files.
Input: I have ~11.000 fasta-files. This files are of different length, for example, some files around 29.000 bp but some files around 28.000 bp.
Output: I would like have info about unique fasta-files.
For example: We have 7 fasta-files. 1,2,5,6 and 7 have the same SNP, so the 1 fasta-file interested me. 1, 3 and 4 samples have unique SNP. In output I have three unique fasta-files.
I was thinking about bash-script, but iles are of different length. I hope that you can help me.
Thank you so much!
if the only difference between those fasta files are the SNPs , you could consider calculating the md5 key for each file (or sequence) and compare those. As soon as there is one char difference between them the md5 key will differ as well, moreover if files or sequence are 100% identical they will have the same md5 key
md5 sums won't work.
You could try CD-HIT.
right.
I was under the impression OP also wanted to get similar length stuff together ...
CDHIT is good approach indeed
Thank you so much! I will try to do with CD-HIT.