Question: Extracting Variants For Select Individuals Overlapping Pre-Specified Regions From 1000 Genomes
1
gravatar for Ryan D
6.8 years ago by
Ryan D3.3k
USA
Ryan D3.3k wrote:

We have a list of a few thousand CNVs/SVs identified using array-based CNV calling methods. They are from five individuals sequenced by the 1000 Genomes Project. We would like to compare breakpoints on the CNVs identified by our calling and those identified by 1000 genomes.

So the question: how can I generate a vcf file (presumably using tabix and vcftools) for a select set of individuals overlapping a list of specified CNVs like the ones below? Moreover, can I do this (thousands of regions) in a single step? Should my regions I query be smaller or larger than the CNVs we identified? And can we filter the results by those that are above threshold size?

Note: it appears BrentP has an answer about how to get multiple regions at once from 1kG here , but if I am pulling all of these from the 1kG FTP server, must I do it by chromosome? And how would pulling regions handle imperfect overlaps as mentioned above?

Individuals:
NA10851
NA18505

Regions:
chr22:22680529-22726814
chr22:22613016-22670785
chr22:41234550-41276824

Thanks, Ryan

1000genomes tabix vcftools • 2.9k views
ADD COMMENTlink modified 6.6 years ago by Laura1.7k • written 6.8 years ago by Ryan D3.3k

a little tip - you can get better answers if you only ask one question per topic. Otherwise, it's difficult for people to reply, and you will only get general answers.

ADD REPLYlink written 6.8 years ago by Giovanni M Dall'Olio26k
2
gravatar for Khader Shameer
6.8 years ago by
Manhattan, NY
Khader Shameer18k wrote:

I will try to answer the second part of your question.

Manuscript that describes 1000 Genomes data management and community access provides extensive details on accessing 1000 genome data.
You can iteratively access data using Samtools, Data Slicer or Tabix, refer to the 1000 genomes for a detailed tutorial. AFAIK, the tabix files will be indexed by chromosomes, you can pull out data as you need from the 1000 Genomes FTP. See the section on How do I get a slice of your vcf files.

ADD COMMENTlink written 6.8 years ago by Khader Shameer18k
2
gravatar for Laura
6.8 years ago by
Laura1.7k
Cambridge UK
Laura1.7k wrote:

Unfortunately there is no batch processing tools currently available for the 1000 genomes data sets. We provide genotypes on a per chromosome basis as the file would simply be to big to provide whole genome

Something like this would be fairly straight forward to script using tabix and perl or a similar scripting language though

The basic tabix command is provided in the faq Khader linked to

http://www.1000genomes.org/faq/how-do-i-get-sub-section-vcf-file

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c > HG00098.chr17_1451000-1472000.20110521.genotypes.vcf.gz

You need to write a script which iterates through your bed file and constructs commands like this

Our vcf filename convention is quite consistent, generally it is Pop.chr.description.YYYYMMDD When the file is whole genome sometimes the chromosome field is missing

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Laura1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1238 users visited in the last hour