Question: Strategy for generating a consensus sequence for 100 complete bacterial genomes?
0
gravatar for Alec Watanabe
16 months ago by
Alec Watanabe60 wrote:

Greetings,

I am working with a 100 complete M. tuberculosis genomes in FASTA format. What I want to do is align all the sequences to search for common genomic regions between all the strains. MAUVE was the only program that I found that could handle this big set of data. Any ideas on how to generate a consensus sequence with these common genomic regions from MAUVE? Is there any other program that can handle such big data and make a consensus sequence? I tried PhyDE but MUSCLE could only align a tiny initial portion of the genome. PhyDE would haven been ideal since it can align and make a consensus sequences, but it does not even work with two whole genomes.

Appreciate the attention.

mauve consensus seq • 574 views
ADD COMMENTlink written 16 months ago by Alec Watanabe60

Are you looking to create a pan genome sequence (consensus sequence may not make sense since the strains could differ significantly)? If so here is a list of software from OmicTools.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax73k

Hi,

At first I thought the sequences were considerably different, but after running Gegenees, the heatmap showed a similarity of 99%+ between all strains. Constructing a pan genome sequence is not a bad idea, but I am worried if by doing that I can end up excluding intergenic portions that could be interesting. I forgot to mention but the main objetive is to run a primer design.

ADD REPLYlink written 16 months ago by Alec Watanabe60

So would it make sense to make more focused regional comparisons (rather than trying to create a general consensus) to assist with primer design?

ADD REPLYlink written 16 months ago by genomax73k

Yes, a more focused comparison would be ideal for both time and computational power. But the problem is that I currently do not know which genomic regions to consider since I can't see what is common or not. Gegenees only generates the heatmap but does not tell what specifically is common or different. Maybe there is another way to do it but I am not seeing it. The only solution I've come up with is to make whole genome alignment to see what's common. After the alignment, I would make a consensus sequence and run the primer design. Although MAUVE aligns portions, they are still different in short portions, but maybe I'll need to do a manual checking and selecting of regions.

ADD REPLYlink modified 16 months ago • written 16 months ago by Alec Watanabe60

That is why I was suggesting pan genome tools. Panseq

Panseq determines the core and accessory regions among a collection of genomic sequences based on user-defined parameters. It readily extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions, constructs files for use in phylogeny programs based on both the presence/absence of accessory regions and SNPs within core regions.

While you don't need the other features, if it extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions then that should get you started.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax73k

Nice, I thought when you said pan genomic analysis, it would only consider genes and exclude intergenic regions, but the description of Panseq says "core and accessory regions", sorry. I will read the documentation of Panseq, see if it suits the purpose, and test it if so. I'll let you know if the software does succeed. Many thanks!

ADD REPLYlink written 16 months ago by Alec Watanabe60

This was just one example. There are other similar programs so take a wider look. Good luck.

ADD REPLYlink written 16 months ago by genomax73k

Just one question. how does the output file of Pansew look like?

ADD REPLYlink written 16 months ago by Alec Watanabe60

Look for ##Description of output files on the page linked above.

ADD REPLYlink written 16 months ago by genomax73k

Hi,

I've been testing Panseq since yesterday and got some results today. I noticed that Panseq creates some PHYLIP files. I never used PHYLIP before, but what PHYLIP program should I use to open the "binary.phylip" and "snp.phylip" files?

ADD REPLYlink written 16 months ago by Alec Watanabe60

If you are not interested in phylogenetic relationships then you could safely ignore phylip files. PHYLIP is not the easiest program to use but you can find a guide here.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax73k

I see. I'm sorry for asking too many questions. I am new to this Panseq program as you know. So if I interpreted correctly, the "binary_table.txt" file shows the pan-genome and also in which strains the genomic fragments are present or absent, right? So, if I choose the fragments in which all the strain possess a "1", theoretically, it is present in all strains and thus are in the "core genome", right? Now, about the "coreGenomeFragments.fasta" file, it shows the fragments that are present in the "core genome", I manually checked some fragments and apparently some of them are not present in all strains, even though the program says so, is it normal?

ADD REPLYlink written 16 months ago by Alec Watanabe60

@Alec: I am sorry but I can't help you with this. I have not used panseq myself. My suggestion was based on your requirement.

Perhaps someone else may be along. You could also create a new post with this question.

ADD REPLYlink written 16 months ago by genomax73k

Many thanks @genomax ! You've helped me a lot just by suggesting the program. I will post another question about this.

ADD REPLYlink written 16 months ago by Alec Watanabe60

I forgot to ask, but is this difference related to the "percentIdentityCutoff" value that we configure in the settings.txt file? For example, if I choose a value of 100, will it print out only exact sequence matches across all strains?

ADD REPLYlink written 16 months ago by Alec Watanabe60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2674 users visited in the last hour