Question

Sequence identity percentage (again)

0

Entering edit mode

8.4 years ago

prostoesh ▴ 20

What is the easiest way to take 2 genomes (bacterial proteins) and compare all genes from one vs all genes from another one?

I know this question was here like a million times, but I still don't get why most advice is about aligning sequences first. I don't need to align genes, I just want to really fast have a statistical info about how many genes from them are 100% similar, how many are 90% similar and so on for all genes.

So far I've tried proteinortho, mafft, oma, cd-hit, get_homologues and some others, but still no luck.

gene sequence genome blast • 2.0k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 8.4 years ago by prostoesh ▴ 20

2

Entering edit mode

What alternate metric do you want propose for this comparison? Especially since you don't want to to do alignments (which would be needed to get % sequence similarity)?

ADD REPLY • link 8.4 years ago by GenoMax 141k

0

Entering edit mode

ok, I actualy wasn't sure about that, thank you for clarifying! I am quite new to this, so I'm learning as it goes.

So my baldest option is to align all with all, and pick a metric (like number of blast identities divided by length of alignment)?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by prostoesh ▴ 20

1

Entering edit mode

What exactly are you trying to achieve?

Doing all vs all comparison is feasible but then you would have to be judicious about where to set the cut-offs for similarity and blast/blat parameters to use. There will be orthologs/paralogs that you will find (besides all the common domains) that will give significant hits.

If you are more interested in finding out how these two genomes relate to each other (i.e. not at gene but at genome level) then you may want to look at Mauve genome alignment tool. It is designed for this type of analysis.

ADD REPLY • link 8.4 years ago by GenoMax 141k

0

Entering edit mode

i am trying to get a numerical distribution, or a chart like this

(i don't need an actual graph - only distribution)

i'll look into mauve, thanks

ADD REPLY • link 8.4 years ago by prostoesh ▴ 20

1

Entering edit mode

So the idea would be to do an all vs all comparison maximizing the Query/Hit coverage (and perhaps only choosing the top hit, so you don't get bogged down with smaller domains etc).

An absolute answer would need lots of careful analysis but if you only want a gross overview this may work.

OrthoMCL solution noted below would also be another option. But it may complicate things since multiple sequences may be lumped together in a cluster.

ADD REPLY • link 8.4 years ago by GenoMax 141k

0

Entering edit mode

ok, thanks for you answers - i will deffinetly try all suggestions!

ADD REPLY • link 8.4 years ago by prostoesh ▴ 20

Ram · Answer 1 · 2015-11-24

1

Entering edit mode

8.4 years ago

Juke34 8.5k

From the sequence clusters, you could write a script that calcul the identity of sequences within each cluster and summarize the whole results.

ADD COMMENT • link 8.4 years ago by Juke34 8.5k

0

Entering edit mode

hm, that sounds like a great idea, thanks!

although I'm a little confused with clusters, because I'm not sure how programms like proteinortho do their job. Do they use all genes, or just the ones, that can be clustered?

also proteinortho, for example, can make clusters consisting of 5-6 genes (2-3 from each genome) and I don't know how to separate them

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by prostoesh ▴ 20

1

Entering edit mode

Clusters should contain, in theory, the sequences that can be clustered. I don't know about the tools you use, but with OrthoMCL (works only from proteins) you have at the end of the cluster file, "clusters" that contain only one sequence, which correspond to the species specific genes (I will call them orphan now). If the tools don't give you the "orphan" genes (whitin the cluster output file or in a separated file) you might deduce them.

For the clusters with more than 2 sequences, yes it will be difficult. I don't know what is the best approach. My suggestion will be the following:

You could choose to calcul for each cluster, the identity of all the sequences from species 1 against all the sequence of species 2 and consider only the best results 2 per 2. If the sequence number is odd, you can treat the last one as "orphan".

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Juke34 8.5k

0

Entering edit mode

yeah, that sounds quite thorough, I will try this

much abliged for help!

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by prostoesh ▴ 20