Question: gVCF files from 1000 Genomes samples
8
gravatar for donfreed
4.5 years ago by
donfreed1.5k
San Francisco
donfreed1.5k wrote:

We are hoping to use 1000 Genomes samples as a population control for our study. The 1000 Genomes Project provides fastq, BAM and VCF files on their ftp site. We do not want to use VCF files as they have been filtered and might not contain variants occurring in our samples (especially false-positive variants in our samples). Using dbSNP is problematic for the same reason.

So it seems like a good alternative is to use 1000 Genomes BAM files. However, it would save us compute time if we could use gVCF files. Does anyone know if gVCF files from 1000 Genomes Project samples are publicly available?

gvcf 1000 genomes • 2.0k views
ADD COMMENTlink modified 5 weeks ago • written 4.5 years ago by donfreed1.5k
1
gravatar for donfreed
5 weeks ago by
donfreed1.5k
San Francisco
donfreed1.5k wrote:

A few years later, gVCF files from the 1000 Genomes project are now publicly available on Figshare

ADD COMMENTlink written 5 weeks ago by donfreed1.5k
5
gravatar for QVINTVS_FABIVS_MAXIMVS
4.5 years ago by
USA SoCal
QVINTVS_FABIVS_MAXIMVS2.4k wrote:

Here's all the info on VCF for 1000 Genomes. Unfortunately I do not think they do not have a gVCF file. Sounds like you might have to resort to the BAM files.

I needed biallelic depth of coverages from 1000 Genomes, but they do not report that (even though the genomes are phased). Luckily , my training set consisted of 27 high coverage samples from 1000 Genomes. So I ran HaplotypeCaller on all 27 BAM files, which took about 12 hours (when you broke it up by chromosome).

If you're experiment was done with high coverage libraries, then a proper control would be the high coverage genomes. There are not too many of those so it may be less work than you think.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by QVINTVS_FABIVS_MAXIMVS2.4k
2

Thanks for the info. Our study is focused on rare variation so in our case sample breadth (more samples) is more important than having the properties of variants in our samples exactly match our control population, which is why we are performing analysis of the low-coverage samples. We do not have to analyze all of the low-coverage samples, but more is better.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by donfreed1.5k

Right on. For me it was sensitivity and not breadth. I've worked with low-coverage samples too. If you're familiar with AWS, 1000Genomes has a S3 public bucket so transfers should be free (I think) on AWS. I wrote a grant proposal, which was a single paragraph, for AWS and they gifted me a lot of credits. I pulled out features from 2,504 low coverage BAM files in about a week or two on AWS. Good luck!

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by QVINTVS_FABIVS_MAXIMVS2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1208 users visited in the last hour