Strategy for generating a consensus sequence for 100 complete bacterial genomes?
0
0
Entering edit mode
6.3 years ago

Greetings,

I am working with a 100 complete M. tuberculosis genomes in FASTA format. What I want to do is align all the sequences to search for common genomic regions between all the strains. MAUVE was the only program that I found that could handle this big set of data. Any ideas on how to generate a consensus sequence with these common genomic regions from MAUVE? Is there any other program that can handle such big data and make a consensus sequence? I tried PhyDE but MUSCLE could only align a tiny initial portion of the genome. PhyDE would haven been ideal since it can align and make a consensus sequences, but it does not even work with two whole genomes.

Appreciate the attention.

consensus seq mauve • 2.8k views
ADD COMMENT
0
Entering edit mode

Are you looking to create a pan genome sequence (consensus sequence may not make sense since the strains could differ significantly)? If so here is a list of software from OmicTools.

ADD REPLY
0
Entering edit mode

Hi,

At first I thought the sequences were considerably different, but after running Gegenees, the heatmap showed a similarity of 99%+ between all strains. Constructing a pan genome sequence is not a bad idea, but I am worried if by doing that I can end up excluding intergenic portions that could be interesting. I forgot to mention but the main objetive is to run a primer design.

ADD REPLY
0
Entering edit mode

So would it make sense to make more focused regional comparisons (rather than trying to create a general consensus) to assist with primer design?

ADD REPLY
0
Entering edit mode

Yes, a more focused comparison would be ideal for both time and computational power. But the problem is that I currently do not know which genomic regions to consider since I can't see what is common or not. Gegenees only generates the heatmap but does not tell what specifically is common or different. Maybe there is another way to do it but I am not seeing it. The only solution I've come up with is to make whole genome alignment to see what's common. After the alignment, I would make a consensus sequence and run the primer design. Although MAUVE aligns portions, they are still different in short portions, but maybe I'll need to do a manual checking and selecting of regions.

ADD REPLY
0
Entering edit mode

That is why I was suggesting pan genome tools. Panseq

Panseq determines the core and accessory regions among a collection of genomic sequences based on user-defined parameters. It readily extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions, constructs files for use in phylogeny programs based on both the presence/absence of accessory regions and SNPs within core regions.

While you don't need the other features, if it extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions then that should get you started.

ADD REPLY
0
Entering edit mode

Nice, I thought when you said pan genomic analysis, it would only consider genes and exclude intergenic regions, but the description of Panseq says "core and accessory regions", sorry. I will read the documentation of Panseq, see if it suits the purpose, and test it if so. I'll let you know if the software does succeed. Many thanks!

ADD REPLY
0
Entering edit mode

This was just one example. There are other similar programs so take a wider look. Good luck.

ADD REPLY
0
Entering edit mode

Just one question. how does the output file of Pansew look like?

ADD REPLY
0
Entering edit mode

Look for ##Description of output files on the page linked above.

ADD REPLY
0
Entering edit mode

Hi,

I've been testing Panseq since yesterday and got some results today. I noticed that Panseq creates some PHYLIP files. I never used PHYLIP before, but what PHYLIP program should I use to open the "binary.phylip" and "snp.phylip" files?

ADD REPLY
0
Entering edit mode

If you are not interested in phylogenetic relationships then you could safely ignore phylip files. PHYLIP is not the easiest program to use but you can find a guide here.

ADD REPLY
0
Entering edit mode

I see. I'm sorry for asking too many questions. I am new to this Panseq program as you know. So if I interpreted correctly, the "binary_table.txt" file shows the pan-genome and also in which strains the genomic fragments are present or absent, right? So, if I choose the fragments in which all the strain possess a "1", theoretically, it is present in all strains and thus are in the "core genome", right? Now, about the "coreGenomeFragments.fasta" file, it shows the fragments that are present in the "core genome", I manually checked some fragments and apparently some of them are not present in all strains, even though the program says so, is it normal?

ADD REPLY
0
Entering edit mode

@Alec: I am sorry but I can't help you with this. I have not used panseq myself. My suggestion was based on your requirement.

Perhaps someone else may be along. You could also create a new post with this question.

ADD REPLY
0
Entering edit mode

Many thanks @genomax ! You've helped me a lot just by suggesting the program. I will post another question about this.

ADD REPLY
0
Entering edit mode

I forgot to ask, but is this difference related to the "percentIdentityCutoff" value that we configure in the settings.txt file? For example, if I choose a value of 100, will it print out only exact sequence matches across all strains?

ADD REPLY

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6