Compare many fasta-files

0

Entering edit mode

3.6 years ago

L_LANKA • 0

Hello!

I have question about compare fasta-files.

Input: I have ~11.000 fasta-files. This files are of different length, for example, some files around 29.000 bp but some files around 28.000 bp.

Output: I would like have info about unique fasta-files.

For example: We have 7 fasta-files. 1,2,5,6 and 7 have the same SNP, so the 1 fasta-file interested me. 1, 3 and 4 samples have unique SNP. In output I have three unique fasta-files.

I was thinking about bash-script, but iles are of different length. I hope that you can help me.

Thank you so much!

fasta compare SNP • 851 views

ADD COMMENT • link 3.6 years ago by L_LANKA • 0

0

Entering edit mode

if the only difference between those fasta files are the SNPs , you could consider calculating the md5 key for each file (or sequence) and compare those. As soon as there is one char difference between them the md5 key will differ as well, moreover if files or sequence are 100% identical they will have the same md5 key

ADD REPLY • link 3.6 years ago by lieven.sterck 15k

1

Entering edit mode

but iles are of different length

md5 sums won't work.

You could try CD-HIT.

ADD REPLY • link 3.6 years ago by GenoMax 141k

0

Entering edit mode

right.

I was under the impression OP also wanted to get similar length stuff together ...