Question: Feeding FASTA-ggsearch36 results for MCL clustering
gravatar for Anand Rao
4.7 years ago by
Anand Rao320
United States
Anand Rao320 wrote:

I have the task of clustering ~ 18K plant proteins, with the ultimate goal of inferring gene gain and loss - this necessitating inference of orthology. This has been a nightmare because of the multi-domain nature, also because one of them is a highly promisciuous domain. Sequences in my orthogroups have poor gappy alignment, and their trees have several branches with little to no bootstrp support.

Therefore, I dont want to really bother any more about finding 'orthologs' as much as I want to simply gather sets of 'homologous protein sequences'. The only domain common to all these proteins - the promiscuous one - is relatively short (48aa) and poorly conserved. So I cant use domain-only alignment or phylogeny for obvious reasons.

Rather than using BLAST's local search algorithm, I've started wondering about ggsearch36 from FASTA package by Bill Pearson. It employs a global-global search algorithm. It also allows the option of producing output in BLAST format (I think tabulated). If I can re-produce global-global FASTA search results in BLAST's -m 8 tabular format, then it should work with MCL, correct?

Other than the workflow logistics, more importantly, would this be scientifically unacceptable for any reason? Or come with any big caveats? I can think of a few, but I'll wait for your responses.

I suppose I'd have to define what a 'sequence homolog' would be for this approach? For example, could I use cutoffs of 90% sequence identity and +/-10% sequence length variation? Any thoughts? Thanks!

local cluster global fasta • 1.4k views
ADD COMMENTlink written 4.7 years ago by Anand Rao320

MCL operates on the adjacency matrix of a (preferably undirected) graph so even if your output is not identical to BLAST's -m 8, you can always post-process your data to get a matrix of similarities between your sequences in one of the formats that mcl accepts as input.

To infer gains/losses, you would still need a tree. You could use the clusters as starting points as in the Treefam strategy (see the first paper and the update).

ADD REPLYlink modified 9 months ago by RamRS30k • written 4.7 years ago by Jean-Karim Heriche23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1535 users visited in the last hour