Question

gVCF files from 1000 Genomes samples

8

Entering edit mode

7.9 years ago

donfreed ★ 1.6k

We are hoping to use 1000 Genomes samples as a population control for our study. The 1000 Genomes Project provides fastq, BAM and VCF files on their ftp site. We do not want to use VCF files as they have been filtered and might not contain variants occurring in our samples (especially false-positive variants in our samples). Using dbSNP is problematic for the same reason.

So it seems like a good alternative is to use 1000 Genomes BAM files. However, it would save us compute time if we could use gVCF files. Does anyone know if gVCF files from 1000 Genomes Project samples are publicly available?

gvcf 1000 Genomes • 3.9k views

ADD COMMENT • link 3.5 years ago by donfreed ★ 1.6k

5

Entering edit mode

7.9 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.5k

Here's all the info on VCF for 1000 Genomes. Unfortunately I do not think they do not have a gVCF file. Sounds like you might have to resort to the BAM files.

I needed biallelic depth of coverages from 1000 Genomes, but they do not report that (even though the genomes are phased). Luckily , my training set consisted of 27 high coverage samples from 1000 Genomes. So I ran HaplotypeCaller on all 27 BAM files, which took about 12 hours (when you broke it up by chromosome).

If you're experiment was done with high coverage libraries, then a proper control would be the high coverage genomes. There are not too many of those so it may be less work than you think.

ADD COMMENT • link 7.9 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

2

Entering edit mode

Thanks for the info. Our study is focused on rare variation so in our case sample breadth (more samples) is more important than having the properties of variants in our samples exactly match our control population, which is why we are performing analysis of the low-coverage samples. We do not have to analyze all of the low-coverage samples, but more is better.

ADD REPLY • link 7.9 years ago by donfreed ★ 1.6k

0

Entering edit mode

Right on. For me it was sensitivity and not breadth. I've worked with low-coverage samples too. If you're familiar with AWS, 1000Genomes has a S3 public bucket so transfers should be free (I think) on AWS. I wrote a grant proposal, which was a single paragraph, for AWS and they gifted me a lot of credits. I pulled out features from 2,504 low coverage BAM files in about a week or two on AWS. Good luck!

ADD REPLY • link 7.9 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k

score 1 · Accepted Answer · 2020-10-15

1

Entering edit mode

3.5 years ago

donfreed ★ 1.6k

A few years later, gVCF files from the 1000 Genomes project are now publicly available on Figshare

ADD COMMENT • link 3.5 years ago by donfreed ★ 1.6k