Question: Compare genotypes between WGS and Targeted pannel
gravatar for mpinsach
2.8 years ago by
mpinsach0 wrote:

Hi all,

I am trying to compare the genotypes between two human cohorts: one sequenced by whole genome sequencing (50X) and another one sequenced using a custom panel (250X). I performed a t-distributed stochastic neighbor embedding (t-SNE) analysis and the two populations look perfectly clustered in two different groups.

I suspect the difference in clustering might be due to the usage of different technologies (WGS and target sequencing).

The DNA was sequenced in an Illumina platform and the SNVs were called using GATK HaplotypeCaller and they were recallibrated for both populations. However, the mean total variants per sample is higher in the targeted sequenced cohort.

I created a matrix of 0/1 for absence/presence of variants in each genomic position reported on the VCF file from the WGS and Target cohorts, as shown in the example below:

                Sample1 Sample2 Sample3
chr3:37428076   0   1   0

I created a final SNVs list by adding the cohort-specific coordinates to the other cohort to have the same number of coordinates.

Does anyone know how to perform this kind of comparison?

Thank you.

snp genome • 646 views
ADD COMMENTlink modified 2.8 years ago by Kevin Blighe66k • written 2.8 years ago by mpinsach0
gravatar for Kevin Blighe
2.8 years ago by
Kevin Blighe66k
Kevin Blighe66k wrote:

You should:

  1. filter the datasets so that only common variants are included
  2. Normalise the VCFS / BCFs (bcftools norm -m-any)
  3. merge everything together
  4. Read the data into PLINK and check samples against 1000 genomes ( see Produce PCA bi-plot for 1000 Genomes Phase III in VCF format )
  5. Run the comparisons in PLINK (e.g. logistic regression)

I do not know anything about sample numbers, disease state, or ethnicity, so, cannot provide specifics for tests.

ADD COMMENTlink written 2.8 years ago by Kevin Blighe66k

Dear Kevin,

thank you so much for your reply. I waited to write you back until I tried your suggestions myself.

I followed all your suggestions as well as your post Produce PCA bi-plot for 10000 Genomes Phase III in VCF format [1] but I got stuck after pruning variants from each chromosome from 1000 Genomes. I also don't know how to merge my cohorts file with the 1000 Genomes to be compared in PLINK.

Regarding the sample specifics, the wgs cohort is composed by 200 healthy individuals while the targeted sequencing cohort is composed by 91 cardiac-diseased individuals. Both cohorts are caucasian. Although one comes from America and the other from Spain.

Thank you.

ADD REPLYlink written 2.8 years ago by mpinsach0

Would Spanish be considered Caucasian or Hispanic? The idea of merging with 1000 Genomes is to specifically gauge the influence of ethnicity in your cohort. Without correcting for ethnicity, you may make false-associations.

You should, in that case, merge your 2 datasets together, and then merge with 1000 Genomes.

Are you receiving any error message?

ADD REPLYlink written 2.8 years ago by Kevin Blighe66k

In the clinical information I received from the Spanish individuals was Caucasian ethnicity.

For my two cohorts I did the following:

  1. Filter the datasets so that only common variants are included. I did it with GATK but I first had to remove multiallelic sites.
  2. Merge everything together with vcf-merge option

Then I followed your instructions from your post "Produce PCA bi-plot for 10000 Genomes Phase III in VCF format" but I don't know in which step I should mix the 1000 Genomes with my merged cohorts and how I should do it.

ADD REPLYlink written 2.8 years ago by mpinsach0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1292 users visited in the last hour