Removing >90% Identical Sequnces From Sequence Alignment
2
0
Entering edit mode
10.2 years ago
Pappu ★ 2.1k

Let me know if there is any software availble for that.

edit:

Basically I want to reduce the number of sequences prior to phylogenetic tree construction. The first step is to remove >90% identical sequences by cd-hit before multiple sequence alignment. Then I want remove the sequences which are >90% identical in the alignment. I am looking for a program for that purpose.

I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments. So I wanted to know if there is software available for that.

msa clustering • 5.8k views
ADD COMMENT
1
Entering edit mode

Not at real question

ADD REPLY
0
Entering edit mode

I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments. So I wanted to know if there is software available for that.

ADD REPLY
1
Entering edit mode

Please edit your question to give more details, about what you are trying to achieve and what is the biological question behind it, then I might re-open it. Hint, you will need at least one or two paragraphs (5-10 sentences) to make a valid question out of this.

ADD REPLY
0
Entering edit mode

I am not able to edit the question since it is closed. Basically I want to reduce the number of sequences prior to phylogenetic tree construction. The first step is to remove >90% identical sequences by cd-hit before multiple sequence alignment. Then I want remove the sequences which are >90% identical in the alignment. I am looking for a program for that purpose.

ADD REPLY
0
Entering edit mode

I have re-opened and inserted your comments, however: what do you mean by "I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments." I think it's supposed to work with fasta files, just as you would use it when you apply it before msa.

Why would you try to apply it after msa?

Why would you have a msa with removing sequences afterwards, that makes the whole multiple alignment invalid.

ADD REPLY
1
Entering edit mode

Unclear. You say that you want to use CD-HIT before multiple sequence alignment. Then you say that CD-HIT does not work with multiple sequence alignment.

CD-HIT should output a FASTA file of non-redundant sequences, suitable for input to an aligner.

ADD REPLY
0
Entering edit mode

I want to use the MSA as input for cd-hit to remove >90% identical sequences.

ADD REPLY
1
Entering edit mode

I think that makes no sense.

ADD REPLY
0
Entering edit mode

I agree. The input to CD-HIT is not a MSA. It's a file of sequences in FASTA format. I think you want to do MSA after CD-HIT.

ADD REPLY
1
Entering edit mode
10.2 years ago
Michael 54k

I think the answer is simply: your pipeline should look like:

  • Input: Fasta file A
  • Run CDhit or other clustering software on A
  • :-> Fasta file with clustered sequences B
  • Run MSA on B
  • :-> MSA

Do not try to alter the MSA after it is generated.

ADD COMMENT
1
Entering edit mode
10.2 years ago
Arnaud Ceol ▴ 860

Are you working with nucleotide or protein sequences? in the latter case the work has already been done by Uniprot/Uniref: http://www.uniprot.org/help/uniref

ADD COMMENT

Login before adding your answer.

Traffic: 2526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6