Question: Removing >90% Identical Sequnces From Sequence Alignment
0
gravatar for Pappu
4.0 years ago by
Pappu1.8k
Pappu1.8k wrote:

Let me know if there is any software availble for that.

edit:

Basically I want to reduce the number of sequences prior to phylogenetic tree construction. The first step is to remove >90% identical sequences by cd-hit before multiple sequence alignment. Then I want remove the sequences which are >90% identical in the alignment. I am looking for a program for that purpose.

I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments. So I wanted to know if there is software available for that.

msa clustering • 2.5k views
ADD COMMENTlink modified 8 months ago by Biostar ♦♦ 20 • written 4.0 years ago by Pappu1.8k
1

Not at real question

ADD REPLYlink written 4.0 years ago by Michael Dondrup43k

I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments. So I wanted to know if there is software available for that.

ADD REPLYlink written 4.0 years ago by Pappu1.8k
1

Please edit your question to give more details, about what you are trying to achieve and what is the biological question behind it, then I might re-open it. Hint, you will need at least one or two paragraphs (5-10 sentences) to make a valid question out of this.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Michael Dondrup43k

I am not able to edit the question since it is closed. Basically I want to reduce the number of sequences prior to phylogenetic tree construction. The first step is to remove >90% identical sequences by cd-hit before multiple sequence alignment. Then I want remove the sequences which are >90% identical in the alignment. I am looking for a program for that purpose.

ADD REPLYlink written 4.0 years ago by Pappu1.8k

I have re-opened and inserted your comments, however: what do you mean by "I use cd-hit to remove sequences by identity. However it does not work with multiple sequence alignments." I think it's supposed to work with fasta files, just as you would use it when you apply it before msa.

Why would you try to apply it after msa?

Why would you have a msa with removing sequences afterwards, that makes the whole multiple alignment invalid.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Michael Dondrup43k
1

Unclear. You say that you want to use CD-HIT before multiple sequence alignment. Then you say that CD-HIT does not work with multiple sequence alignment.

CD-HIT should output a FASTA file of non-redundant sequences, suitable for input to an aligner.

ADD REPLYlink written 4.0 years ago by Neilfws47k

I want to use the MSA as input for cd-hit to remove >90% identical sequences.

ADD REPLYlink written 4.0 years ago by Pappu1.8k
1

I think that makes no sense.

ADD REPLYlink written 4.0 years ago by Michael Dondrup43k

I agree. The input to CD-HIT is not a MSA. It's a file of sequences in FASTA format. I think you want to do MSA after CD-HIT.

ADD REPLYlink written 4.0 years ago by Neilfws47k
1
gravatar for Michael Dondrup
4.0 years ago by
Bergen, Norway
Michael Dondrup43k wrote:

I think the answer is simply: your pipeline should look like:

  • Input: Fasta file A
  • Run CDhit or other clustering software on A
  • :-> Fasta file with clustered sequences B
  • Run MSA on B
  • :-> MSA

Do not try to alter the MSA after it is generated.

ADD COMMENTlink written 4.0 years ago by Michael Dondrup43k
1
gravatar for Arnaud Ceol
4.0 years ago by
Arnaud Ceol820
Milan, Italy
Arnaud Ceol820 wrote:

Are you working with nucleotide or protein sequences? in the latter case the work has already been done by Uniprot/Uniref: http://www.uniprot.org/help/uniref

ADD COMMENTlink written 4.0 years ago by Arnaud Ceol820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 947 users visited in the last hour