Question

Reduce redundancy of alignment based on taxonomy and identity

0

Entering edit mode

8.3 years ago

fhsantanna ▴ 610

I would like to reduce the redundancy of my alignment based on taxonomy and identity.

I know that CD-HIT can do it based on identity, but do you know a way to also include taxonomy data? My concern is to preserve sequences very similar that could have been transferred between different taxa. If you know a script that can do it, I would appreciate.

Here is my idea, but I believe it is not too clever. It should have a simpler way...

Firstly, separate the sequences in files based on taxonomy, let's say by genus.

For each file, do a CD-HIT based on a identity value threshold.

After, concatenate the output files from CD-HIT.

Any suggestions?

CD-HIT alignment redundancy • 2.3k views

ADD COMMENT • link 8.3 years ago by fhsantanna ▴ 610

1

Entering edit mode

I would think that it's reasonable to go about it the way you described it. Separate on taxonomy first and combine the cluster outputs later. This makes sense since you want to ensure that you keep sequences that may be transferred between the species, as they would be otherwise reduced to the same cluster...

ADD REPLY • link 8.3 years ago by Jenez ▴ 540

0

Entering edit mode

I would like to reduce the redundancy of my alignment based on taxonomy and identity.

You lost me at the first sentence. Can you clarify this, by perhaps expanding that sentence into a paragraph?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Brian Bushnell 20k

0

Entering edit mode

I want to do a comprehensive phylogenetic reconstruction of a particular protein family.

Let us consider that I have blasted my protein of interest against a database from Genbank (refseq, nr). Even filtering the results by e value, consider that I would obtain thousands of proteins. Many of them would have little phylogenetic value because they would be too much similar among each other, and they could interfere the phylogenetic reconstruction and interpretation. For example, imagine I would have ten proteins of the genus Escherichia (or taxa of higher levels), and they would have more than 90% identity among each other. In order to decrease the complexity of the input data for phylogenetic reconstruction, I would pare down these sequences, maintaining only an "archetype ortholog" of Escherichia. This way, I would expect to reduce my input data to hundreds of proteins not only to improve computation, but also to ease my interpretation.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by fhsantanna ▴ 610