Any Benchmark Code Running Against 1000 Genomes Data
1
1
Entering edit mode
11.1 years ago
qiming.he ▴ 10

Hi,

I am looking for a suite of benchmark codes/scripts that take 1000 genomes datasets (e.g., those published on http://aws.amazon.com/1000genomes s3://1000genomes) as input, and do some non-trivial work (queries). Since it is for benchmark-ing new parallel algorithms purpose, I do not care what exact work it performs, as long as it is 1) compute-intensive (from computer scientist perspective) and 2) meaningful (from biologist perspective). The benchmark I am looking for is more like LinPACK or Terasort. For the sake of simplicity, it can use or chain off-the-shelf tools like samtools, vcftool. Can anyone point me (with little knowledge about DNA) to a right direction?

Thanks in advance

1000genomes • 1.7k views
ADD COMMENT
2
Entering edit mode
11.1 years ago

You can simply use vcftools to calculate LD statistics using the --hap-r2 statistics. It takes a lot of time, and the last time I've tried it I had to stop it after a couple of days. Is this what you were looking for, or do you want something faster or slower?

The following code parses a VCF file, applies some filters (minor allele frequency, quality, etc..)m and calculates the R2 statistics for each snp.

./bin/vcftools --remove-filtered-all --remove-indels --phased --gzvcf data/vcf/chr1.vcf.gz --recode --maf 0.01 --keep-INFO-all --out data/vcf_filtered/chr${SGE_TASK_ID} --hap-r2 --minDP 2 --minGQ 0.05
ADD COMMENT

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6