How to find three consecutive orthologous genes in 800 bacterial genomes?
1
0
Entering edit mode
4.5 years ago
natasha.sernova ★ 4.0k

Dear all, I am afraid this particular question has been asked several times, but I failed to find any of the previous posts. I have 800 bacterial genomes. I know that some of these genomes may have the group of three consecutive genes with any probable insertions of some foreign genes. Sometimes one or two genes from such a group are lost – it does not matter, I need the rest left. What is the easiest way to find any orthologs of these genes in 800 bacterial genomes? I am not sure three simple alignments of a single gene sequence with all 800 genomes will help. (I read such a discussion some time ago, I have not found it.) And I am not sure I know a good soft to do it. I hope there is a better way I have forgotten about. Thank you very much! Sincerely, Natasha

genome bacteria software alignment • 1.3k views
ADD COMMENT
2
Entering edit mode

This would not be a simple thing since you admit that

some of these genomes may have the group of three consecutive genes with any probable insertions of some foreign genes.

I would suggest that you use the three genes independently to locate their homologs in 800 genomes and then try to reconcile the results to see if they are within a certain distance and/or present in the order you expect.

Ortholog-finder may also come in handy.

ADD REPLY
1
Entering edit mode

before I really chip in here, can I ask for a clarification of the following

Sometimes one or two genes from such a group are lost – it does not matter, I need the rest left.

do I understand correctly that from your group of three, up to two can be lost (== so only one of three remains) ? How would you detect that one then as once being part of that group?

I'm asking because I might have an approach but that has a lower limit of three (eg. in group of 4 one can be lost) but for less it becomes less feasible or even impossible

ADD REPLY
0
Entering edit mode

Unfortunately it’s possible. The situation like: gene1-insertion-gene3 is common, as well as just any single gene left out of these three, like: gene1-insertion1-insertion2 or insertion1-gene2- insertion2 or insertion1-insertion2-gene 3. I will have to check these three genes separately as @genomax suggested. Oh, and measure the distance between genes in this case: gene1-insertion-gene3. But how to make it easy? Will ortholog-finder help with this task?

ADD REPLY
0
Entering edit mode

I understand, but what I want to say is that if only gene1 is left, you can not determine whether it was once of the group or not (and thus always has been a single gene, and the other two were never there). without additional evidence that is? are the inserted ones 'conserved' ?

Is the order important btw? is it always g1 g2 g3 or can it be g2 g1g3 , ... ?

ADD REPLY
0
Entering edit mode

The order is strongly conserved. It depends only upon the strand. It's either g1 g2 g3 or g3 g2 g1. Actually I was wrong - I don't have insertions, I may have some simple replacement of any of the three genes to some 'hypothetical' gene that is not orthologous to the replaced gene.

ADD REPLY
0
Entering edit mode

Dear all, many thanks for your answers, all of them are really helpful!

ADD REPLY
2
Entering edit mode
4.5 years ago
Mensur Dlakic ★ 27k

I can think of two options, and will list them by increasing time investment.

1) Submit your proteins, one at a time, to STRING. This will automatically create gene neighborhood plots for all genomes it has, and that should give you a pretty good idea how frequently these proteins are found next to each other. I know that is not the same as interrogating your 800 genomes, but it is likely that most of them will already be included in STRING.

2) A variation of what was already suggested - get proteomes for all your species, concatenate them into a single file, and search your proteins of interest individually against this database. Post-processing of the three outputs would involve extracting GI numbers for matches, and finding how many times you have 3 consecutive GI numbers when you combine the three outputs. If you want to allow for insertions, you can stipulate that a difference between smallest and largest GI number can be up to 4 instead of 2, which would allow for 2 inserted genes. By the way, this can be done at DNA level as well by concatenating .ffn instead of .faa files.

ADD COMMENT
0
Entering edit mode

A partial problem is that I have only *.gb-format files. gbk-format disappeared in 2013-14. *.gb are human-readable text files, but how to transfer them to any other format I don't know. It's 'almost' a previous gbk-format, but literally 'almost'...

ADD REPLY
1
Entering edit mode

Try seqret from EMBOSS to convert the files.

ADD REPLY
0
Entering edit mode

Many thanks! *.gb implies genbank, I think.

http://emboss.sourceforge.net/docs/faq.html

Q) What sequence formats are supported?

A) Many:

gcg, embl, swissprot, fasta, ncbi, genbank, nbrf, codata, strider, clustal, phylip, acedb, msf, ig, staden, text, raw, asis

ADD REPLY
1
Entering edit mode

sreformat from the old HMMer package (v2.3-ish) can convert GenBank files to FASTa and other formats.

ADD REPLY
0
Entering edit mode

Thank you very much!

But any newer HMMer package cannot?

ADD REPLY
1
Entering edit mode

It seems that the equivalent package in current HMMer (esl-reformat) can't do GenBank conversion. HMMer keeps an archive of old versions, and I think that v2.3.2 will work. sreformat is an auxiliary program that can be found in squid directory after compilation.

ADD REPLY

Login before adding your answer.

Traffic: 1492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6