Greetings,
I am working with a 100 complete M. tuberculosis genomes in FASTA format. What I want to do is align all the sequences to search for common genomic regions between all the strains. MAUVE was the only program that I found that could handle this big set of data. Any ideas on how to generate a consensus sequence with these common genomic regions from MAUVE? Is there any other program that can handle such big data and make a consensus sequence? I tried PhyDE but MUSCLE could only align a tiny initial portion of the genome. PhyDE would haven been ideal since it can align and make a consensus sequences, but it does not even work with two whole genomes.
Appreciate the attention.
Are you looking to create a pan genome sequence (consensus sequence may not make sense since the strains could differ significantly)? If so here is a list of software from OmicTools.
Hi,
At first I thought the sequences were considerably different, but after running Gegenees, the heatmap showed a similarity of 99%+ between all strains. Constructing a pan genome sequence is not a bad idea, but I am worried if by doing that I can end up excluding intergenic portions that could be interesting. I forgot to mention but the main objetive is to run a primer design.
So would it make sense to make more focused regional comparisons (rather than trying to create a general consensus) to assist with primer design?
Yes, a more focused comparison would be ideal for both time and computational power. But the problem is that I currently do not know which genomic regions to consider since I can't see what is common or not. Gegenees only generates the heatmap but does not tell what specifically is common or different. Maybe there is another way to do it but I am not seeing it. The only solution I've come up with is to make whole genome alignment to see what's common. After the alignment, I would make a consensus sequence and run the primer design. Although MAUVE aligns portions, they are still different in short portions, but maybe I'll need to do a manual checking and selecting of regions.
That is why I was suggesting pan genome tools. Panseq
While you don't need the other features, if it
extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions
then that should get you started.Nice, I thought when you said pan genomic analysis, it would only consider genes and exclude intergenic regions, but the description of Panseq says "core and accessory regions", sorry. I will read the documentation of Panseq, see if it suits the purpose, and test it if so. I'll let you know if the software does succeed. Many thanks!
This was just one example. There are other similar programs so take a wider look. Good luck.
Just one question. how does the output file of Pansew look like?
Look for
##Description of output files
on the page linked above.Hi,
I've been testing Panseq since yesterday and got some results today. I noticed that Panseq creates some PHYLIP files. I never used PHYLIP before, but what PHYLIP program should I use to open the "binary.phylip" and "snp.phylip" files?
If you are not interested in phylogenetic relationships then you could safely ignore phylip files. PHYLIP is not the easiest program to use but you can find a guide here.
I see. I'm sorry for asking too many questions. I am new to this Panseq program as you know. So if I interpreted correctly, the "binary_table.txt" file shows the pan-genome and also in which strains the genomic fragments are present or absent, right? So, if I choose the fragments in which all the strain possess a "1", theoretically, it is present in all strains and thus are in the "core genome", right? Now, about the "coreGenomeFragments.fasta" file, it shows the fragments that are present in the "core genome", I manually checked some fragments and apparently some of them are not present in all strains, even though the program says so, is it normal?
@Alec: I am sorry but I can't help you with this. I have not used
panseq
myself. My suggestion was based on your requirement.Perhaps someone else may be along. You could also create a new post with this question.
Many thanks @genomax ! You've helped me a lot just by suggesting the program. I will post another question about this.
I forgot to ask, but is this difference related to the "percentIdentityCutoff" value that we configure in the settings.txt file? For example, if I choose a value of 100, will it print out only exact sequence matches across all strains?