Question

Fast way to extract part of genotypes data from 1000 Genome VCF files

3

Entering edit mode

5.7 years ago

Shicheng Guo ★ 9.4k

Hi All,

I need to do some permutation test to random sampling millions of SNPs set genotyping data from 1000 genome phase III dataset. each SNP-set only contains ~300 SNPs, however, I found the extract process is quite slow. for example, for the ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf, to extract these 300 SNPs require almost 20 mins.

plink --vcf ~/hpc/db/hg19/1000Genome/chr6.uni.vcf --extract chr6.rs6457620.txt --r inter-chr dprime --out rs6457620

Is there any fast way to extract the subset quickly?

real 17m52.822s user 8m39.925s sys 1m20.681s

Done! Thanks.

tabix  ~/hpc/db/hg19/1000Genome/chr6.uni.vcf.gz  -R chr6.rs6457620.txt  > output.vcf

100 times faster!!!

Thanks.

fast vcf 1000Genome • 4.8k views

ADD COMMENT • link updated 5.7 years ago by chrchang523 10k • written 5.7 years ago by Shicheng Guo ★ 9.4k

1

Entering edit mode

Check @Kevin's post here:
Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

ADD REPLY • link 5.7 years ago by GenoMax 141k

1

Entering edit mode

Even tabix is slow, though. If you want quick access and manipulation of VCF-formatted data, then you should convert your data to BCF (Binary Call Format). BCFtools view would have been another option for you and could do the filtering in [likely] under 1 minute. I regularly filter / look up the entire 1000 Genomes Phase III data (1.3 gigabytes in BCF format).

BCFtools is rapid because it utilises htslib

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

3

Entering edit mode

Even tabix is slow, though.

I take this to make my first comparison between tabix and compressed bcf. For this I've made a bed file out of 30000 randomly selected SNP from the 1000 Genomes file.

Here's the result:

$ time (tabix -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > /dev/null)
( tabix -R snp_30000.bed  > /dev/null; )  26,14s user 0,68s system 82% cpu 32,342 total
$ time (bcftools view -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz > /dev/null)                                                 
( bcftools view -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz  > /dev/null; )  19,30s user 0,68s system 75% cpu 26,340 total
$ time (bcftools view -Ou -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz > /dev/null)
( bcftools view -Ou -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz > /dev/null; )  16,97s user 0,21s system 99% cpu 17,190 total

Nice :)

BCFtools is rapid because it utilises htslib

I'm not sure if this is the reason. I thought tabix utilises htslib as well.

fin swimmer

ADD REPLY • link 5.7 years ago by finswimmer 16k

1

Entering edit mode

Interesting!

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

1

Entering edit mode

Great try. Thanks fin swimmer!

ADD REPLY • link 5.6 years ago by Shicheng Guo ★ 9.4k

0

Entering edit mode

Great suggestion! Is there any suggestion on how to merge chromosome-splitted VCF to one big BCF?

ADD REPLY • link 5.5 years ago by Shicheng Guo ★ 9.4k

1

Entering edit mode

If the samples are the same in all VCFs, you can do this with cat/grep/etc.

There is a pre-merged 1000 Genomes phase 3 dataset in plink2 format, with annotations included, at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 (follow the "Quick start" instructions). You can export a merged VCF from this. (You can also use it directly; almost everything other than position-based lookup of a tiny number of SNPs is much faster with plink2 than bcftools.)

ADD REPLY • link 5.5 years ago by chrchang523 10k

0

Entering edit mode

Thanks Chris. I just find I need transfer between vcftools and plink again and again since Plink will remove the 'phase' status of the data. Is there any way to keep the phased genotypes in plink format?

ADD REPLY • link 5.5 years ago by Shicheng Guo ★ 9.4k

2

Entering edit mode

Stick to plink2 --pfile/--make-pgen when you need to preserve phase.

ADD REPLY • link 5.5 years ago by chrchang523 10k

4

Entering edit mode

5.7 years ago

chrchang523 10k

If you are using plink to operate on the same VCF file multiple times, you should convert it to plink-/plink2-format once and then only work with the converted data. This will be much, much faster than repeating the VCF conversion every time.

Also, plink 2.0's VCF converter is both a lot more powerful (multiallelic variants, dosage, phase, QUAL/FILTER/INFO all supported) and a lot faster than plink 1.9's.

ADD COMMENT • link 5.7 years ago by chrchang523 10k

1

Entering edit mode

Great suggestion. Thanks.Chris!! It will try plink 2.0 more

ADD REPLY • link 5.7 years ago by Shicheng Guo ★ 9.4k

3

Entering edit mode

5.7 years ago

Shicheng Guo ★ 9.4k

With the help of tabix, just few minutes rather than days to complete the LD

use strict;
use Cwd;
my $dir=getcwd;
chdir $dir;
my $genome="/gpfs/home/guosa/hpc/db/hg19/1000Genome";
mkdir "Shuffle" if ! -e "Shuffle";
my $file="/gpfs/home/guosa/nc/GWAS-378.DMER.bed.sort.bed.distance.bed";

open F,$file || die "cannot open $file\n";
my %snp;
while(<F>){
my ($chr,$start1,$end1,$rs,undef,$start2,$end2,undef,undef,$epi)=split/\s+/;
my(undef,$newchr)=split/chr/,$chr;
my($oldfrag,$newfrag,$tmp);
if($start1<=$start2){
$oldfrag="$newchr:$start1-$end2";
$tmp=2*$start1-$start2;
$newfrag="$newchr:$tmp-$start1";
}else{
$oldfrag="$newchr:$start2-$end1";
$tmp=2*$end1-$end2;
$newfrag="$newchr:$start1-$tmp";
}
open OUT,">./Shuffle/$chr.$rs.job";
print OUT "#PBS -N $rs\n";
print OUT "#PBS -o ./Shuffle/temp/\n";
print OUT "#PBS -e ./Shuffle/temp/\n";
print OUT "cd $dir/Shuffle\n";
print OUT "tabix -h $genome/$chr.vcf.gz $oldfrag > $chr.$rs.1.vcf\n";
print OUT "tabix -h $genome/$chr.vcf.gz $newfrag > $chr.$rs.2.vcf\n";
print OUT "plink --vcf $chr.$rs.1.vcf --r inter-chr dprime --out $chr.$rs.1\n";
print OUT "plink --vcf $chr.$rs.2.vcf --r inter-chr dprime --out $chr.$rs.2\n";
close OUT;
system("qsub ./Shuffle/$chr.$rs.job")
}

ADD COMMENT • link 5.7 years ago by Shicheng Guo ★ 9.4k

2

Entering edit mode

Why did you post an answer if ATpoint has already provided one? You even answered your own question by editing your original question.

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

1

Entering edit mode

I think OP provided an answer, based on my suggestion with Tabix. I like OPs who follow up their posts to help others. Keep it up @Shicheng Guo =)

ADD REPLY • link 5.7 years ago by ATpoint 81k

1

Entering edit mode

I just see a confusing PBS script with little relation to the original problem.

¯_(ツ)_/¯

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

1

Entering edit mode

it will be perfect if tabix could accept rs number as the input.

ADD REPLY • link 5.7 years ago by Shicheng Guo ★ 9.4k

0

Entering edit mode

See A: tabix for ID column

ADD REPLY • link 5.7 years ago by finswimmer 16k

1

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

1

Entering edit mode

There is a lot in this script that I would discourage. For example, do not change working directories within a script, it is unexpected behavior. Also the part where the filenames are hard coded - these are things you do not want to be doing in scripts.

ADD REPLY • link 5.7 years ago by Ram 43k

score 8 · Accepted Answer · 2018-08-23

8

Entering edit mode

5.7 years ago

ATpoint 81k

Sounds like a job for Tabix. Have a look at the manual and the -R option. It allows quick retrieval of regions from a tabix-indexed VCF, GFF/GTF/BED file.

ADD COMMENT • link 5.7 years ago by ATpoint 81k