Question

How to construct phylogeny based on multiple sequence alignment of orthologs without assembling the genomes

0

Entering edit mode

7.7 years ago

abbhinay • 0

I have two sets of phylogeny-

1) Species phylogeny (in black)- Species B to D have published genomes and I have assembled a genome for Species A. I constructed the phylogeny based on multiple sequence alignment of protein orthologs across Species A to D (OrthoMCl -> MUSCLE -> trimAl -> MrBayes).

2) Subspecies phylogeny (in red) - I also have sequencing data for different subspecies and isolates of Species A. I mapped these onto Species A genome, identified SNPs (using GATK) and drew a SNP-based phylogeny.

My question now is "what is the best way to integrate both these phylogenies into one?".

I do not want to assemble the genomes for all the subspecies (tedious for 20 isolates), and I do not want to map the Species B-D reads onto Species A (They are very divergent and inferring through MSA is best I think).

I can infer nucleic acid/protein sequences of the subspecies' orthologs from variant calls and add them to the multiple sequence alignment in Species phylogeny. But I find the output of tools like vcf2fq and FastaAlternateReferenceMaker complicated -New Fasta Sequence From Reference Fasta And Variant Calls File?. In this case, how to deal with SNPs in repetitive regions that we usually exclude from analysis?

Is there any other way to achieve this?

example phylogeny

SNP alignment phylogeny • 2.6k views

ADD COMMENT • link 7.7 years ago by abbhinay • 0

0

Entering edit mode

assemble the genomes ... tedious for 20 isolates

I find the output of tools like vcf2fq and FastaAlternateReferenceMaker complicated

What is more efficient may depend on genome size and ploidity. For bacteria I would recommend to assemble the reads denovo with spades, which is fast and very easy to use. For bacteria denovo assembling is not at all "tedious".

ADD REPLY • link 7.7 years ago by piet ★ 1.8k

0

Entering edit mode

Genome size is 20Mb and the organism is haploid. So denovo assembly is tedious (ordering, filling gaps, annotating genes).

ADD REPLY • link 7.7 years ago by abbhinay • 0

0

Entering edit mode

how to deal with SNPs in repetitive regions

You should not do phylogeny on repetitive regions. Repeats are formed by recombination and recombination events will distort the phylogenetic signal.

ADD REPLY • link 7.7 years ago by piet ★ 1.8k

0

Entering edit mode

In addition, highly repetitive regions are prone to sequencing errors, and thus unreliable variant calls.

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks @piet @WouterDeCoster. Will keep that in mind! As of now, I do have discarded all SNPs in DustMasker predicted regions.

ADD REPLY • link 7.7 years ago by abbhinay • 0