Question

Similar binned bacterial genomes dnt cluster together in dendrogram/phylogenetic tree in R

1

Entering edit mode

5.5 years ago

manpy64 ▴ 10

Dear all, I am a rookie in data analysis and stuck with my results dnt know how to interpret them.

I started with 7 metagenomic assemblies of different species of Azolla fern. The aim was to identify bacteria in leaf ecosytem of azolla different species. Out hypotheisis was, if there are similar bacteria which repeat within the azollas different species, they will cluster together when their genomes will be plotted in dendrogram or a tree.

The method used spades to get assemblies, BWA was used to do backmapping, samtools for sorting, metabat for binning and checkm for to see completeness and contamination of bins.

Then prokka was used to annotate the genomes and uniport ids were obtained and table was made of all uniport id of all the bins. the table was changed to binary table and then used to create a dendrogram in R.

The dendrogram and then tree made by using dendrogram in fig tree. In the tree i observed that the bacteria are clustering according to the metagenomic sample or plant host not on the basis of their similar taxonomical name eg rhizobiales is clustering with burkholderiales of same metagenomic assembly but not with rhizobiales of other host plant assembly.

Im on the dead end how to intrepret these results and what can i deduce from it. and are there other ways to improve my approach? Can i compare similar taxonomical bins directly of different metagenomic assemblies any suggestions will be valuable.

kind regards

manpy student utrecht university holland

Assembly gene next-gen alignment SNP • 1.4k views

ADD COMMENT • link updated 5.5 years ago by Joe 21k • written 5.5 years ago by manpy64 ▴ 10

1

Entering edit mode

Please take a moment to come up with a short title that is also informative/on point. Putting your actual question in the title is not a good practice.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

ok sir i will try to come with small question

ADD REPLY • link 5.5 years ago by manpy64 ▴ 10

1

Entering edit mode

Can you clarify if fern sequences were excluded (either during library prep stage or by informatics afterwards)? Did you get an equivalent amount of data from all samples?

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

yes sir, we chose here to filter the sequencing data for plant DNA, since we were only interested in microbial DNA. The sequencing data was got rid of any plant DNA by mapping (aligning) the reads to a reference plant genome. Only the sequencing data that did not map anything was kept for further analysis. And yes we got kind of equal data from samples

ADD REPLY • link 5.5 years ago by manpy64 ▴ 10

2

Entering edit mode

Thanks for clarifying that. Can you also comment on what happened to the sequences quantitatively? How much data did you lose and was an equivalent amount left over for every sample?

Perhaps you don't have enough data (there is no guarantee that bacterial genomes were fully sampled or you did not lose useful data during the filtering) to draw a useful conclusion.

BTW: There is no need to use honorifics. We are all fellow scientists.

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

we only used the data dna extracted from leaf pocket of these fern because it contains many symbiotic bacteria the assembly of the scaffolds was already made i started with to find out how much abundant each scaffold is in the different metagenomic samples. I alignED the illumina reads (FastQ files) to the scaffolds (fasta) of the assembly in a step called backmapping with a tool called BWA (Burrows-wheeler aligner) and then sorting was done

ADD REPLY • link 5.5 years ago by manpy64 ▴ 10

0

Entering edit mode

i donot know how to see how much data we lost assemblies were already created by my supervisor i started from backmapping step

ADD REPLY • link 5.5 years ago by manpy64 ▴ 10

score 3 · Accepted Answer · 2018-10-08

3

Entering edit mode

5.5 years ago

Joe 21k

An alternative approach might be to calculate mash distances between all your genomes directly. Assuming your taxonomic assignments are largely correct, this would save you having to mess about with extracting the uniprot IDs, binary matrix etc.

Since you can’t align that many full genomes, mash distances are a good surrogate for genome similarity.

You can then draw your trees using pairwise mash distances among all your genomes.

I’m not 100% sure how much you’ll need to polish your genomes etc. You may need to reorder contigs and perhaps concatenate but if you read up on it I’m sure it’ll become clear.

ADD COMMENT • link 5.5 years ago by Joe 21k

0

Entering edit mode

dear Healey, if we suppose my taxonomical assignment is not correct and I think to compare the genomes via UniProt ids lists is not a good approach. Can I still use mash difference to find the genome similarity between different genomes? kind regards manpy

ADD REPLY • link 5.5 years ago by manpy64 ▴ 10

1

Entering edit mode

Yes, the approach will still work, it just means that you won't be able to literally read out your results with 100% confidence at the end.

For a simple example, lets say you have 3 genomes, and you get all the mash distances (picking some meaningless example numbers):

       A
 0.1  / \  0.22
     B - C
      0.3

which would be roughly equivalent to the tree:

   +- A
+--|
|  +--- B
|
+---- C

But lets suppose B is incorrectly assigned. You might then go on to say that B should be closely related to A, and you end up putting the wrong label inside this cluster (this could be that B belongs to 1 subspecies, and A and C to another subspecies).

In actual fact, the sequences likely are similar, as the mash distances suggest, but your taxonomic assignment is wrong. I don't think this is a significant issue however, as it should all be derived from the sequence itself through your binning and distance calculation. So, assuming the sequence isn't incorrect in some way, I'm sure you'll be fine.

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

dear Healey thank you very much, As my hypothesis was that the binned bacterial genomes with similar taxonomical names will cluster together in a tree.As these so-called similar named bacterial genomes which repeat themselves in different metagenomic assemblies of different species of Azolla But not cluster together. But cluster with bacterial genomes with different taxonomical names from the same metagenomic assembly. This observation made me conclude that (as I used the lists of UniProt ids which repeat themselves in every bacterial genome as input for my tree) there are genes in these genomes which repeat themselves within the metagenomic assembly of their specific host but not between the metagenomic assembly of different species. on the basis of This, I can say my binning process was not correct or something else. healey what you think. I think mash difference will help me comparing the genomes. And them come to a viable conclusion. kind regards manpy

ADD REPLY • link 5.5 years ago by manpy64 ▴ 10

1

Entering edit mode

I'm not sure I fully understand your Uniprot based approach, but my gut feeling is that its too clumsy to be very accurate.

Essentially the task will reduce to you creating the clusters, and if you spot a taxonomic label that is out of place, you will need to go back and check whether it appears out of place because of an issue with the sequence, or because it is labelled wrong.

Looking for 'repeated' genes seems to me to be a red herring, as you may not know exactly what's caused that gene to repeat - it could be a misassembly. Within a metagenome, there are some genes which are bound to repeat. How many copies you have will depend on how many genomes were in your metagenome sample, not just how many times that gene is present within any single genome.

The approach with mash distances may not work, but it was what occurred to me first, and is loosely designed for this problem of comparing many whole genomes. If your assemblies are poor though, it may not perform optimally.

It shouldn't take too long to try at least.

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

healey the uniprot id approach was like this, after prokka , i tried to find a list of uniprot ids of very bacterial genome then how many times these ids repeat in a bacterial genome. then i joined this list of every bin to the list of the particular metagenomic assembly then i joined all lists of all metagenomic assembly in a big table converted this table to binary table in R and then dendrogram and then tree in fig tree i hope you understand my approach now

ADD REPLY • link 5.5 years ago by manpy64 ▴ 10