Question

Any Benchmark Code Running Against 1000 Genomes Data

1

Entering edit mode

11.1 years ago

qiming.he ▴ 10

Hi,

I am looking for a suite of benchmark codes/scripts that take 1000 genomes datasets (e.g., those published on http://aws.amazon.com/1000genomes s3://1000genomes) as input, and do some non-trivial work (queries). Since it is for benchmark-ing new parallel algorithms purpose, I do not care what exact work it performs, as long as it is 1) compute-intensive (from computer scientist perspective) and 2) meaningful (from biologist perspective). The benchmark I am looking for is more like LinPACK or Terasort. For the sake of simplicity, it can use or chain off-the-shelf tools like samtools, vcftool. Can anyone point me (with little knowledge about DNA) to a right direction?

Thanks in advance

1000genomes • 1.7k views

ADD COMMENT • link updated 11.1 years ago by Giovanni M Dall'Olio 28k • written 11.1 years ago by qiming.he ▴ 10

score 2 · Answer 1 · 2013-03-20

You can simply use vcftools to calculate LD statistics using the --hap-r2 statistics. It takes a lot of time, and the last time I've tried it I had to stop it after a couple of days. Is this what you were looking for, or do you want something faster or slower?

The following code parses a VCF file, applies some filters (minor allele frequency, quality, etc..)m and calculates the R2 statistics for each snp.

./bin/vcftools --remove-filtered-all --remove-indels --phased --gzvcf data/vcf/chr1.vcf.gz --recode --maf 0.01 --keep-INFO-all --out data/vcf_filtered/chr${SGE_TASK_ID} --hap-r2 --minDP 2 --minGQ 0.05