How To Extract The Core Genes From The Orthomcl Output File?
3
6
Entering edit mode
10.2 years ago
Lisa ▴ 330

Hi. I was wondering if anybody can help me figure out how to use Orthomcl to identify the core genome of E. coli genomes? I have 52 E. coli genomes that I used in orthomcl to produce ortholog groups. I followed all the steps in the user guide, until I got to the end. Now I'm left with this massive file of ortholog groups, but I'm unsure how to proceed.

This is a snippet from the middle of my output file, as the head command just gives too much information as it's my biggest ortholog group. The part before the colon is the ortholog group, the parts after that are genomes and genes which are clustered together into groups.

ecoli6370: col125|YP_006311412.1 col139|YP_007556103.1 col23|YP_001729413.1 col3|NP_286258.1 col4|NP_308598.1 col53|YP_002998320.1 col55|YP_003043686.1 col56|YP_003053130.1 col7|YP_488800.1 col73|YP_003498239.1 col92|YP_006127895.1
ecoli6371: col125|YP_006312035.1 col127|YP_006770839.1 col131|YP_006779890.1 col134|YP_006785029.1 col3|NP_286985.1 col31|YP_002271784.1 col4|NP_309246.1 col45|YP_002397150.1 col57|YP_003079099.1 col59|YP_003222735.1 col64|YP_003233659.1
ecoli6372: col125|YP_006312040.1 col127|YP_006770834.1 col131|YP_006779885.1 col134|YP_006785024.1 col3|NP_286990.1 col31|YP_002271776.1 col4|NP_309251.1 col45|YP_002397155.1 col57|YP_003079092.1 col59|YP_003222730.1 col64|YP_003233664.1

I tried converting this file to a binary matrix, following the instructions from here (http://smokeandumami.com/2010/01/21/gene-accumulation-curves-in-r/), but I'm still stuck with how to proceed.

Thanks, I appreciate any help you can give me. Please let me know if I should provide any more information.

Lisa

Sorry for the delay, here's an example of what my binary matrix looks like. I just took a few lines as it's so large.

"ecoli1000" "ecoli1001" "ecoli1002" "ecoli1003" "ecoli1004" "ecoli1005" 
"col0"   1   1   0   0   1   0
"col1"   0   1   0   0   0   1
"col2"   0   0   1   0   1   1
"col3"   0   1   0   0   0   0
"col4"   1   0   0   1   1   1
"col5"   1   0   0   1   0   0
orthomcl • 7.2k views
ADD COMMENT
0
Entering edit mode

Could you show us the binary matrix? I believe it'll be easier to explain it from that.

ADD REPLY
4
Entering edit mode
10.1 years ago
sentausa ▴ 650

Anyway, I'll try to explain it without the binary matrix.

Since you are interested to find the core genes, basically all you have to do is to find ortholog groups from the OrthoMCL results that contain all 52 strains. If a strain does not have a gene/protein in an ortholog group, it means that this gene/protein is absent in the strain. Therefore, this gene/protein is not part of the core genome, since the definition of a species' core genome is all genes that belong to all strains of the species.

So, in the binary matrix shown on the blog, you'd be interested only to the columns that have no 0 in them.

ADD COMMENT
0
Entering edit mode

Thanks that makes a bit more sense. It seems really simple when you say it like that, so I think I was just having temporary brain melt or something.

ADD REPLY
1
Entering edit mode
4.8 years ago

Use this code parseOrthoMCLOutput.py. It will generate all core, accessory and uniq genes fasta files.

ADD COMMENT
0
Entering edit mode
9.6 years ago
amanjain • 0

I have a very very simple way to find core gene clusters through excel. Tell me if anyone needs help........

If anyone needs help on venn diagrams try http://bioinformatics.psb.ugent.be/webtools/Venn/ it will do your work in seconds.

ADD COMMENT
0
Entering edit mode

Hi, I need help with this very very simple way to find core gene clusters through excel. Could you explain me how?

ADD REPLY

Login before adding your answer.

Traffic: 1994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6