Question

How To Know Percentage Of Identity Between Every Pair Of Orthologous Genes In Several Genomes?

2

Entering edit mode

10.9 years ago

sentausa ▴ 650

Hi all,

I need to find out the percentage of identity between every pair of orthologous genes in 4 different (but closely related) bacteria.

The dataset I have is the nucleotide sequences of each gene (I mean, ORFs) in the genomes and the information on which gene is orthologous to which (based on OrthoMCL result). There are ~1500 orthologous groups, so at the end I hope to have ~1500 tables which show percentage of identity among the genes in each group. Well, even better is to have ~1500 identity percentage ranges since these are what I'm really after.

Is there a software to do this? (Sorry but I haven't searched for it myself since I don't even know what to search for.)

If such software doesn't exist, I'm thinking to build one myself since I'm learning Python. Any suggestion for that? I'm thinking to use global alignment algorithm like Needleman–Wunsch's, and preferably using Windows (since this is the only available option for me; but please don't hesitate to answer if you have a Linux solution).

Thank you.

(Edited to explain OS choice)

python alignment • 14k views

ADD COMMENT • link updated 10.8 years ago by Eric Normandeau 11k • written 10.9 years ago by sentausa ▴ 650

0

Entering edit mode

It is hard for me to imagine that Windows is the only available option since Linux is free. Since you seem to want to do some more bioinformatics work in the future, I will restate what has already been said elsewhere: Learn Linux, Python, and Bash thoroughly. This is simply essential. It is a long process, but everything you learn along the way is directly usable and useful. START NOW! ;)

ADD REPLY • link 10.8 years ago by Eric Normandeau 11k

score 3 · Answer 1 · 2013-06-12

3

Entering edit mode

10.9 years ago

Arnaud Ceol ▴ 860

The EMBOSS ( "The European Molecular Biology Open Software Suite" ) includes many tools to align sequences (local or global) and compute similarity, identity, distances etc. It includes an implementation of the Needleman–Wunsch algorithm. I'm using it indeed to calculate the sequence identity between orthologs predicted by other tools like InParanoid, OMA, etc. If you want to program in python, the BioPython library provides a wrapper around EMBOSS programs: http://biopython.org/DIST/docs/tutorial/Tutorial.html. Although it is easier to install everything in Linux systems, it also works well in Windows.

ADD COMMENT • link 10.9 years ago by Arnaud Ceol ▴ 860

0

Entering edit mode

here is the link to the EMBOSS suite: http://emboss.sourceforge.net/

ADD REPLY • link 10.9 years ago by Arnaud Ceol ▴ 860

0

Entering edit mode

Thanks. I'll look into it.

ADD REPLY • link 10.9 years ago by sentausa ▴ 650

0

Entering edit mode

Some lines of example using Needleman-Wunsh, provided that you have to files seq1.fasta and seq2.fasta to compare:

cline = NeedleCommandline(r"./emboss/needle", asequence="seq1.fasta", bsequence="seq2.fasta", gapopen=10, gapextend=1, outfile="needle.txt")
return_code = subprocess.call(str(cline), shell=(sys.platform != "win32"))

needle = open("needle.txt")

for line in needle:
    if line.find("Identity") > 0:
       nidentity = line[line.find(":") + 1:line.find("/")].strip()
       nidentitypercent = line[line.find("(") + 1:line.find("%")]
    if line.find("Similarity") > 0:
       nsimilarity = line[line.find(":") + 1:line.find("/")].strip()
       nsimilaritypercent = line[line.find("(") + 1:line.find("%")]
needle.close()

ADD REPLY • link 10.7 years ago by Arnaud Ceol ▴ 860

score 2 · Answer 2 · 2013-06-11

2

Entering edit mode

10.9 years ago

Damian Kao 16k

Sounds like a great beginner's project.

Some points to consider:

OrthoMCL's orthologous groups will contain up to 4 ids each for each bacteria (if you exclude paralogs). So you won't strictly have one percent identity value for each group.
Maybe a better way will be to parse the ortholog group file and generate 6 pair-wise ortholog files accordingly. Then generate percent identity ranges for each of the 6 pair-wise orthologs.
I think the all vs all blast that you performed for OrthoMCL already contains percent identity score. You could just use the orthologous group clusters and parse through the blast results for percent identity.
Think about what you are trying to accomplish with this data. How would you interpret the range of percent identities you get?

ADD COMMENT • link 10.9 years ago by Damian Kao 16k

0

Entering edit mode

Thank you for the answer. I don't quite understand your first and second points. Actually I have "parsed" (manually) the OrthoMCL result so that there is only at maximum one protein id for one bacteria in an orthologous group. And what do you mean with parsing the ortholog group file and generating 6 pair-wise ortholog files? The all vs all blast might already contain percent identity score, but it is for the protein, right? What I need is the percentage identity for the nucleotide sequence, so what do you mean by using the orthologous group clusters and parsing through the blast results for percent identity?

ADD REPLY • link 10.9 years ago by sentausa ▴ 650

0

Entering edit mode

You performed orthomcl on the 4 species of bacteria so each orthologous group could have a gene from each of the 4 species. In those cases, you won't be able to obtain a single percent identity for that group since there are 6 possible pair-wise comparisons.

ADD REPLY • link 10.9 years ago by Damian Kao 16k

0

Entering edit mode

I understand that there are 6 possible pair-wise comparisons; that's why I expect to have percent identity range and not a single value. But could you please explain me more your other points? I'd really appreciate it. Thanks.

ADD REPLY • link 10.9 years ago by sentausa ▴ 650

score 1 · Answer 3 · 2013-06-11

1

Entering edit mode

10.9 years ago

Biojl ★ 1.7k

Why would you want to do the pairwise alignments instead of doing a single multiple sequence alignment for each ortholog gene?

In any case you can use one of several alignment programmes available. I would recommend MAFFT since is quite fast, reliable and easy to use. You can also try t-coffee or muscle.

I would call the aligning programme from python and use it as an opportunity to learn a bit the language. Once the alignments are created is quite easy to create a function that calculates the percentage of identity for each alignment. You can also take a look at the BioPython module, since both aligners and parsers for alignments outputs are done.

It will be a lot easier for you if you start using linux because almost everything in bioinformatics is command line / linux oriented. You can start using Ubuntu which is quite similar to windows in some aspects.

ADD COMMENT • link 10.9 years ago by Biojl ★ 1.7k

0

Entering edit mode

Thank you for answering. I'd like to do pairwise alignments instead of a single multiple alignment for each ortholog gene because I want to know the percent identity between each orthologous gene pair. Can I get that by doing multiple alignment? How?

Well, I should have mentioned in my question (I just edited it to do exactly that) that the reason of my OS choice was that Windows is the only available one for me right now. But I agree with you on the usefulness of linux in bioinformatics.

ADD REPLY • link 10.9 years ago by sentausa ▴ 650

0

Entering edit mode

You can extract the pairwise identity for each pair of orthologs from a multiple alignment of the 4 using a simple script. But it may be more accurate to do the alignments pairwise directly. You can still follow my recommendations, but instead of providing all 4 orthologs per gene you can generate all the pairwise alignments with MAFFT or others by providing just two sequences at a time.

ADD REPLY • link 10.9 years ago by Biojl ★ 1.7k

score 0 · Answer 4 · 2013-06-11

0

Entering edit mode

10.9 years ago

suzanne.mcgaugh ▴ 60

Hi, It may be easier for you to create multiple alignments using MAAFT, Muscle or (if protein) MSAprobs and then make a distance matrix. If you have only a PC, you can make the distance matrices in BioEdit or DNASP.

This is going to be brutal if you can't automate it somehow. Perhaps look into GenAlEx which is an excel plugin- it may have this capability. I know GenAlEx can calculate Nei's Genetic Distance. Then may be you can write an excel macro to crank through it.

If you do delve into Unix (which I can't encourage you strongly enough to do), try http://molpopgen.org/software/lseqsoftware.html the Kimura80 and KaKsguesstimator programs will do what you need.

ADD COMMENT • link 10.9 years ago by suzanne.mcgaugh ▴ 60

0

Entering edit mode

Hi and thanks for your answer. Actually this is what one of my colleagues suggested me to do (calculating distance matrix using MEGA, for example), but I wonder if we can consider distance (as in distance matrix) to be the same as percentage identity. Don't genetic distances like Nei's and Kimura's imply more things (evolutionary speaking) than just nucleotide percentage identity?

ADD REPLY • link 10.9 years ago by sentausa ▴ 650

0

Entering edit mode

In bioedit, under 'Alignment' there is a 'Sequence Identity Matrix."

Otherwise, this might be a place to start with python: http://stackoverflow.com/questions/16266622/in-python-calculate-percent-identity-between-two-strings

ADD REPLY • link 10.9 years ago by suzanne.mcgaugh ▴ 60