Question

Removing >90% Identical Sequnces From Sequence Alignment

0

Entering edit mode

10.7 years ago

Pappu ★ 2.1k

Let me know if there is any software availble for that.

edit:

Basically I want to reduce the number of sequences prior to phylogenetic tree construction. The first step is to remove >90% identical sequences by cd-hit before multiple sequence alignment. Then I want remove the sequences which are >90% identical in the alignment. I am looking for a program for that purpose.

I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments. So I wanted to know if there is software available for that.

msa clustering • 6.2k views

ADD COMMENT • link updated 7.4 years ago by Biostar 20 • written 10.7 years ago by Pappu ★ 2.1k

1

Entering edit mode

Not at real question

ADD REPLY • link 10.7 years ago by Michael 55k

0

Entering edit mode

I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments. So I wanted to know if there is software available for that.

ADD REPLY • link 10.7 years ago by Pappu ★ 2.1k

1

Entering edit mode

Please edit your question to give more details, about what you are trying to achieve and what is the biological question behind it, then I might re-open it. Hint, you will need at least one or two paragraphs (5-10 sentences) to make a valid question out of this.

ADD REPLY • link 10.7 years ago by Michael 55k

0

Entering edit mode

I am not able to edit the question since it is closed. Basically I want to reduce the number of sequences prior to phylogenetic tree construction. The first step is to remove >90% identical sequences by cd-hit before multiple sequence alignment. Then I want remove the sequences which are >90% identical in the alignment. I am looking for a program for that purpose.

ADD REPLY • link 10.7 years ago by Pappu ★ 2.1k

0

Entering edit mode

I have re-opened and inserted your comments, however: what do you mean by "I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments." I think it's supposed to work with fasta files, just as you would use it when you apply it before msa.

Why would you try to apply it after msa?

Why would you have a msa with removing sequences afterwards, that makes the whole multiple alignment invalid.

ADD REPLY • link 10.7 years ago by Michael 55k

1

Entering edit mode

Unclear. You say that you want to use CD-HIT before multiple sequence alignment. Then you say that CD-HIT does not work with multiple sequence alignment.

CD-HIT should output a FASTA file of non-redundant sequences, suitable for input to an aligner.

ADD REPLY • link 10.7 years ago by Neilfws 49k

0

Entering edit mode

I want to use the MSA as input for cd-hit to remove >90% identical sequences.

ADD REPLY • link 10.7 years ago by Pappu ★ 2.1k

1

Entering edit mode

I think that makes no sense.

ADD REPLY • link 10.7 years ago by Michael 55k

0

Entering edit mode

I agree. The input to CD-HIT is not a MSA. It's a file of sequences in FASTA format. I think you want to do MSA after CD-HIT.

ADD REPLY • link 10.7 years ago by Neilfws 49k

score 1 · Answer 1 · 2014-02-25

1

Entering edit mode

10.7 years ago

Michael 55k

I think the answer is simply: your pipeline should look like:

Input: Fasta file A
Run CDhit or other clustering software on A
:-> Fasta file with clustered sequences B
Run MSA on B
:-> MSA

Do not try to alter the MSA after it is generated.

ADD COMMENT • link 10.7 years ago by Michael 55k

score 1 · Answer 2 · 2014-02-25

1

Entering edit mode

10.7 years ago

Arnaud Ceol ▴ 860

Are you working with nucleotide or protein sequences? in the latter case the work has already been done by Uniprot/Uniref: http://www.uniprot.org/help/uniref

ADD COMMENT • link 10.7 years ago by Arnaud Ceol ▴ 860