Question: Compare many fasta-files
0
gravatar for L_LANKA
11 weeks ago by
L_LANKA0
L_LANKA0 wrote:

Hello!

I have question about compare fasta-files.

Input: I have ~11.000 fasta-files. This files are of different length, for example, some files around 29.000 bp but some files around 28.000 bp.

Output: I would like have info about unique fasta-files.

For example: We have 7 fasta-files. 1,2,5,6 and 7 have the same SNP, so the 1 fasta-file interested me. 1, 3 and 4 samples have unique SNP. In output I have three unique fasta-files.

I was thinking about bash-script, but iles are of different length. I hope that you can help me.

Thank you so much!

snp compare fasta • 148 views
ADD COMMENTlink written 11 weeks ago by L_LANKA0

if the only difference between those fasta files are the SNPs , you could consider calculating the md5 key for each file (or sequence) and compare those. As soon as there is one char difference between them the md5 key will differ as well, moreover if files or sequence are 100% identical they will have the same md5 key

ADD REPLYlink written 11 weeks ago by lieven.sterck8.9k
1

but iles are of different length

md5 sums won't work.

You could try CD-HIT.

ADD REPLYlink written 11 weeks ago by genomax92k

right.

I was under the impression OP also wanted to get similar length stuff together ...

CDHIT is good approach indeed

ADD REPLYlink written 11 weeks ago by lieven.sterck8.9k

Thank you so much! I will try to do with CD-HIT.

ADD REPLYlink written 11 weeks ago by L_LANKA0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1660 users visited in the last hour