Phylogeny from mixed GBS and WGS
Entering edit mode
7 weeks ago
maxrwjones ▴ 60

Hi all,

I'm a PhD student with some experience working with transcriptomic and epigenomic data, but I'm new to phylogeny reconstruction. I've been tasked with making a phylogeny for a set of accessions of a crop species, but from what I understand the data types may not be compatible.

I have 233 accessions with ~14x whole-genome short-read sequencing data and, from a collaborator, 297 accessions with GBS data. More details:

  • The short read data is ~60M reads of Illumina 150 bp PE reads per accession.
  • The GBS data comes from a ddRAD-based library preparation using the enzymes PstI and NlaII. The digested DNA was adaptor-ligated and then PCR amplified.
  • The size of the reference genome is approximately 600 Mb.

I can easily derive a Variant Call File (VCF) for both data types, and I think these can be merged using "bcftools merge" - though of course there will be missing data for the majority of SNPs detected from the WGS data.

However, I have been advised by another student that this merged VCF may not yield a valid phylogeny (after filtering and conversion to e.g. PHYLIP) because the two genotyping methods have inherent detection biases that could distort the results. An extreme outcome could be that the two methods produce - falsely - entirely separate clades.

Is this something I should be worried about? If so, is there any set of computational corrections that can be applied to account for the biases of these different methods?

Many thanks, Max

VCF Phylogeny GBS WGS tree • 287 views
Entering edit mode
7 weeks ago
dthorbur ★ 1.9k

What is the purpose of this phylogeny? Do you need to use all variants detected across all samples? A phylogeny with 530 samples using WGS/GBS data will have enormous computational overheads, even for neighbor joining trees.

I agree with your colleague though. GBS and WGS have significant differences in variant detection capabilities since GBS is a form of reduced representation approach and mixing these sequencing approaches usually doesn't sound like a good idea without very careful consideration.

For the phylogeny, I would look at the distribution of coverage and find a handful of suitable loci with similar depth across both WGS and GBS samples. Ideally this will overlap with loci commonly used to differentiate populations in your model system. Then you can build gene trees or trees with variants from these equal depth areas. You could even pool a few of these equal depth loci for more accuracy at the cost of increasing computational overheads.

You could also visualise your candidate variant groups in a PCA before generating a phylogeny to see if samples are generally clustering in expected ways. But this step comes with potential inferential biases if you have captured something unexpected and can't explain the distribution based on your expectations.

Entering edit mode

Hi dthorbur,

Thanks so much for your reply! The purpose of the phylogeny is a little convoluted, but it is not for a study with a deep evolutionary focus. Essentially, the population with GBS data is a USDA gene bank collection that is readily and globally accessible to researchers, while the WGS data represents a collection that is not accessible outside it's host nation. I wish to find a set of USDA accessions that represent the major clades found within the WGS accessions. We intend to construct a pangenome resource based on these target accessions.

Your idea of selecting a group of loci with similar depth distributions sounds good, that is definitely something I will try. What order of magnitude would you suggest a 'handful of loci' should be? A hundred per chromosome? A thousand? Unfortunately my study organism doesn't have established sets of loci used for population analyses, so I don't think I can follow-up on that angle.

If you have any further advice I'd really appreciate it!

Thanks again, Max


Login before adding your answer.

Traffic: 1459 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6