Question

How To Find The Pan Genome Of 30 Bacterial Strains

2

Entering edit mode

12.7 years ago

Naren ★ 1.0k

I have found out the core ortholog set (Core Genome) of 30 bacterial strains using NCBI Blast Package. But finding Pan Genome (Unique genes + accessory genes + Core genes) of same dataset of 30 organisms is becoming hectic. As it is not possible to align each genome with other, around overall 900 times. I can`t derive any other logical pattern so that I can determine the accessory genes without repeats. Please help. Thanks in advance.

genome bacteria • 6.6k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 12.7 years ago by Naren ★ 1.0k

1

Entering edit mode

12.7 years ago

Asaf 10k

Maybe you can somehow use the precomputed CLUSTERS database of NCBI (http://www.ncbi.nlm.nih.gov/proteinclusters). It contains clusters of ortholog proteins.

ADD COMMENT • link 12.7 years ago by Asaf 10k

1

Entering edit mode

9.8 years ago

Naren ★ 1.0k

I was working on this since I asked this question and after this long work I, Myself built one tool which I named BPGA-Bacterial Pan Genome Analysis pipeline.

Along with core, accessory and unique genes, it also has many features like functional and pathway assignments and statistics and more.

It is available at my souceforge page

ADD COMMENT • link 9.8 years ago by Naren ★ 1.0k

1

Entering edit mode

First I would like to commend you on pursuing it over the years, that is very admirable.

Now as a scientific software goes there are quite a few more essential steps.

Where is the source code where is the documentation, where is what the tool actually does? Where are the example inputs and outputs. All that is required for a proper scientific software.

I and many others would object to running an, executable especially a windows based one.

Put your code on Github instead of the awful sourceforge, open the sources and show what you can do. Most good companies hire off of github directly, I have myself received many offers on my github account alone.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.8 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks for the appreciation. I am planning to compile Linux and Mac executables too. Of course, your suggestion about GitHub is better.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.8 years ago by Naren ★ 1.0k

1

Entering edit mode

I refer to source code, let people compile your code so that there is less danger of having a compromised binary.

I would strongly urge everyone to NEVER download and run binaries.

I found that the way to tell who is a novice programmer is whether they are willing to show the source code. Those that are starting out often seem to think their code is somehow precious and everyone is out to steal it and sell as their own. Nothing could be further from the truth.

Let people use and understand what the software does, put whatever license you want on it, if you want to retain commercial rights so be it. Do everything you can to demonstrate that it is worth other people's time

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.8 years ago by Istvan Albert 102k

score 5 · Accepted Answer · 2012-11-12

You probably would not need to align every gene against every other (if that's what you meant above). A simpler technique would be to add keep adding genes to a database if they are sufficiently different from the genes that are already there. As you perform the alignments you will need a simple program to tabulate which genes have been hit just once over the entire process (these are the unique), genes that were hit for every strain (core genes) the rest are the accessories.

You might also want to consult what the literature says:

The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates