Question: Fast way to extract part of genotypes data from 1000 Genome VCF files
3
gravatar for Shicheng Guo
10 months ago by
Shicheng Guo7.5k
Shicheng Guo7.5k wrote:

Hi All,

I need to do some permutation test to random sampling millions of SNPs set genotyping data from 1000 genome phase III dataset. each SNP-set only contains ~300 SNPs, however, I found the extract process is quite slow. for example, for the ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf, to extract these 300 SNPs require almost 20 mins.

plink --vcf ~/hpc/db/hg19/1000Genome/chr6.uni.vcf --extract chr6.rs6457620.txt --r inter-chr dprime --out rs6457620

Is there any fast way to extract the subset quickly?

real 17m52.822s user 8m39.925s sys 1m20.681s

Done! Thanks.

tabix  ~/hpc/db/hg19/1000Genome/chr6.uni.vcf.gz  -R chr6.rs6457620.txt  > output.vcf

100 times faster!!!

Thanks.

1000genome vcf fast • 1.1k views
ADD COMMENTlink modified 10 months ago by chrchang5235.3k • written 10 months ago by Shicheng Guo7.5k
1

Check @Kevin's post here:
Produce PCA bi-plot for 1000 Genomes Phase III in VCF format

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax68k
1

Even tabix is slow, though. If you want quick access and manipulation of VCF-formatted data, then you should convert your data to BCF (Binary Call Format). BCFtools view would have been another option for you and could do the filtering in [likely] under 1 minute. I regularly filter / look up the entire 1000 Genomes Phase III data (1.3 gigabytes in BCF format).

BCFtools is rapid because it utilises htslib

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe44k
3

Even tabix is slow, though.

I take this to make my first comparison between tabix and compressed bcf. For this I've made a bed file out of 30000 randomly selected SNP from the 1000 Genomes file.

Here's the result:

$ time (tabix -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > /dev/null)
( tabix -R snp_30000.bed  > /dev/null; )  26,14s user 0,68s system 82% cpu 32,342 total
$ time (bcftools view -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz > /dev/null)                                                 
( bcftools view -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz  > /dev/null; )  19,30s user 0,68s system 75% cpu 26,340 total
$ time (bcftools view -Ou -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz > /dev/null)
( bcftools view -Ou -R snp_30000.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.bcf.gz > /dev/null; )  16,97s user 0,21s system 99% cpu 17,190 total

Nice :)

BCFtools is rapid because it utilises htslib

I'm not sure if this is the reason. I thought tabix utilises htslib as well.

fin swimmer

ADD REPLYlink modified 10 months ago • written 10 months ago by finswimmer11k
1

Interesting!

ADD REPLYlink written 10 months ago by Kevin Blighe44k
1

Great try. Thanks fin swimmer!

ADD REPLYlink written 10 months ago by Shicheng Guo7.5k

Great suggestion! Is there any suggestion on how to merge chromosome-splitted VCF to one big BCF?

ADD REPLYlink written 7 months ago by Shicheng Guo7.5k
1

If the samples are the same in all VCFs, you can do this with cat/grep/etc.

There is a pre-merged 1000 Genomes phase 3 dataset in plink2 format, with annotations included, at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 (follow the "Quick start" instructions). You can export a merged VCF from this. (You can also use it directly; almost everything other than position-based lookup of a tiny number of SNPs is much faster with plink2 than bcftools.)

ADD REPLYlink written 7 months ago by chrchang5235.3k

Thanks Chris. I just find I need transfer between vcftools and plink again and again since Plink will remove the 'phase' status of the data. Is there any way to keep the phased genotypes in plink format?

ADD REPLYlink written 7 months ago by Shicheng Guo7.5k
2

Stick to plink2 --pfile/--make-pgen when you need to preserve phase.

ADD REPLYlink written 7 months ago by chrchang5235.3k
7
gravatar for ATpoint
10 months ago by
ATpoint18k
Germany
ATpoint18k wrote:

Sounds like a job for Tabix. Have a look at the manual and the -R option. It allows quick retrieval of regions from a tabix-indexed VCF, GFF/GTF/BED file.

ADD COMMENTlink written 10 months ago by ATpoint18k
4
gravatar for chrchang523
10 months ago by
chrchang5235.3k
United States
chrchang5235.3k wrote:

If you are using plink to operate on the same VCF file multiple times, you should convert it to plink-/plink2-format once and then only work with the converted data. This will be much, much faster than repeating the VCF conversion every time.

Also, plink 2.0's VCF converter is both a lot more powerful (multiallelic variants, dosage, phase, QUAL/FILTER/INFO all supported) and a lot faster than plink 1.9's.

ADD COMMENTlink written 10 months ago by chrchang5235.3k
1

Great suggestion. Thanks.Chris!! It will try plink 2.0 more

ADD REPLYlink modified 10 months ago • written 10 months ago by Shicheng Guo7.5k
3
gravatar for Shicheng Guo
10 months ago by
Shicheng Guo7.5k
Shicheng Guo7.5k wrote:

With the help of tabix, just few minutes rather than days to complete the LD

use strict;
use Cwd;
my $dir=getcwd;
chdir $dir;
my $genome="/gpfs/home/guosa/hpc/db/hg19/1000Genome";
mkdir "Shuffle" if ! -e "Shuffle";
my $file="/gpfs/home/guosa/nc/GWAS-378.DMER.bed.sort.bed.distance.bed";

open F,$file || die "cannot open $file\n";
my %snp;
while(<F>){
my ($chr,$start1,$end1,$rs,undef,$start2,$end2,undef,undef,$epi)=split/\s+/;
my(undef,$newchr)=split/chr/,$chr;
my($oldfrag,$newfrag,$tmp);
if($start1<=$start2){
$oldfrag="$newchr:$start1-$end2";
$tmp=2*$start1-$start2;
$newfrag="$newchr:$tmp-$start1";
}else{
$oldfrag="$newchr:$start2-$end1";
$tmp=2*$end1-$end2;
$newfrag="$newchr:$start1-$tmp";
}
open OUT,">./Shuffle/$chr.$rs.job";
print OUT "#PBS -N $rs\n";
print OUT "#PBS -o ./Shuffle/temp/\n";
print OUT "#PBS -e ./Shuffle/temp/\n";
print OUT "cd $dir/Shuffle\n";
print OUT "tabix -h $genome/$chr.vcf.gz $oldfrag > $chr.$rs.1.vcf\n";
print OUT "tabix -h $genome/$chr.vcf.gz $newfrag > $chr.$rs.2.vcf\n";
print OUT "plink --vcf $chr.$rs.1.vcf --r inter-chr dprime --out $chr.$rs.1\n";
print OUT "plink --vcf $chr.$rs.2.vcf --r inter-chr dprime --out $chr.$rs.2\n";
close OUT;
system("qsub ./Shuffle/$chr.$rs.job")
}
ADD COMMENTlink written 10 months ago by Shicheng Guo7.5k
2

Why did you post an answer if ATpoint has already provided one? You even answered your own question by editing your original question.

ADD REPLYlink written 10 months ago by Kevin Blighe44k
1

I think OP provided an answer, based on my suggestion with Tabix. I like OPs who follow up their posts to help others. Keep it up @Shicheng Guo =)

ADD REPLYlink written 10 months ago by ATpoint18k
1

I just see a confusing PBS script with little relation to the original problem.

¯_(ツ)_/¯

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe44k
1

it will be perfect if tabix could accept rs number as the input.

ADD REPLYlink written 10 months ago by Shicheng Guo7.5k

See A: tabix for ID column

ADD REPLYlink written 10 months ago by finswimmer11k
1

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink written 10 months ago by Kevin Blighe44k
1

There is a lot in this script that I would discourage. For example, do not change working directories within a script, it is unexpected behavior. Also the part where the filenames are hard coded - these are things you do not want to be doing in scripts.

ADD REPLYlink written 10 months ago by RamRS22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2109 users visited in the last hour