A pipeline for comparative bacterial genomics involving cohorts (i.e. between group comparisons)?
2
0
Entering edit mode
8.2 years ago
Nick ▴ 290

This is the original version of the question. Please jump to the edit below as it makes the question clearer.

I don't have a lot of experience with bacterial NGS but have recently started working on a project that explores the differences (e.g. SNPs, indels) between bacterial strains. My first idea was to take a reference and go for some sort of GATK-based pipeline. I was thinking of first aligning the Illumina reads for strain A against a reference, find the indels and SNPs, than do the same for strain B and than compare the results. Then I realised that I might be beating around the bush here. Wouldn't it make sense to directly compare the raw reads of strain A to those of strain B and find out about the differences without aligning to an intermediate reference?

The closest equivalent setting I can think of would be in cancer genomics, i.e. comparing tumours to healthy cells. Would any of those cancer genomics pipeline be applicable?

EDIT

Judging from the responses I haven't expressed my case very well. Here is my project:

  • I have a set of unknown strains of species A taken from, say, the oral cavities of a group of people. Each individual gives rise to a single sample. The bacterial species is cultured and then sequenced on MiSeq
  • Similarly to above I have a set of unknown strains of the same species but taken from another body compartment - say the gut. This may (or may not) be the same people as above.

The task is to find whether there is anything that systematically differs between the first and the second set.

I was first thinking of picking a reference and than doing alignment for each sample. But this poses the problem of picking the right reference. As we have dozens of samples this may not even be the same reference. Isn't there a way to compare the samples as sets so that you can say something along the lines:

  • the first set lacks gene X

or

  • the first set has a particular mutation in a gene Y

In order to be able to do make such statements do I really need to perform alignment on each sample in isolation?

snp next-gen alignment • 3.0k views
ADD COMMENT
1
Entering edit mode
8.2 years ago
DG 7.3k

I would say that no the cancer genomic pipelines probably are not very directly applicable, at least without a lot of fiddling. The biggest issue is that those pipelines are usually pretty geared towards reference-based variant calling, and highly optimized to working on human samples.

For bacteria you can do reference based alignment, but it really depends on the particular strains, species, and genera of bacteria you are working with as to how likely this is to work very well. Individual bacterial strains can be as different from one another genetically as many species are, and this isn't only due to straightforward SNPs and Indels, but gene content differences. Presence/Absence of particular genes and gene cassettes are huge differentiating factors between bacterial strains. So you can do some reference based comparisons but most would do de novo assembly followed by whole-genome comparisons and alignments. There are tons of tools out there for different parts of these analyses so it is really worth researching what is out there and what you want to do. Some programs you may find useful:

A5 and the updated A5-miseq, these are supposed to be pretty good assembly pipelines, from the Darling Lab

MAUVE is another assembler, pretty widely used, also from the Darling Lab

Velvet is another assembler

MUMMER is a whole genome aligner

BRIG is a tool for comparing and aligning genomes. Creates circular comparison images

Many other tools out there, and I've only done a little bit of this myself but those were all tools I used or that came highly recommended.

ADD COMMENT
0
Entering edit mode

Thanks, Dan. I've edited my question to make it more precise. Can you, perhaps, suggest a pipeline that would help me address this task?

ADD REPLY
0
Entering edit mode

Not beyond what I've already suggested really. I'm not an expert and have only dabbled in this area.

ADD REPLY
1
Entering edit mode
8.2 years ago
dago ★ 2.8k

As Dan Gaston said the variability among bacteria can go much beyond SNP and indels. A common theme in microbiology is horizontal gene transfer "HGT". Usually gene involved in virulence and antibiotic resistance are spread in this way. Therefore, closely related strains may have large insertion in their genomes that led them to have different phenotypes. If you do not consider that you will miss a big portion of biological information. This is just to mention a small example.

I think the best thing to do is to assemble your genomes and compare them amongst each other in the first place considering gene content (e.g. shared genes and unique ones).

Spades is a really good assembler too.

Then you can move on considering SNP, dNdS and so on.

Good luck

ADD COMMENT
0
Entering edit mode

Thanks, dago. I've edited my question to make it more precise. Can you, perhaps, suggest a pipeline that would help me address this task?

ADD REPLY
0
Entering edit mode

From what I read you have cultured the bacteria and you should have draft genomes. If this is the case it still holds what I wrote you above. In the first place you can calculate the orthologous genee amongst the strains and start to get some more information on what they share and what they have as unique. If you do not do that You can not make any further analysis. Evolutionary rate, mutation and so on can only be inferred on orthologous genes. Mapping the read on a reference for me does not make much sense, for the reason I wrote above. Of course you can play around and after identify the orthologous say that the bacteria from the mouth share all a specific group of genes that are not present in the one form the gut and so on.

ADD REPLY

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6