Question

Shortening Whole genome for Benchmarking

0

Entering edit mode

8.1 years ago

bshifaw ▴ 50

I wanted to test new pre processing bioinformatics workflows. I have a pair of whole genome sequences, FASTQ files ~100gb each. Running this data set takes a mighty long time to complete, and running them on a new workflow they take a mighty long time to fail. I want to know if it would be fine if i could cut 30gb from the 100gb and run it through my new workflows, that way i don't have to wait a week before it fails due to a silly mistake?

sequence benchmarking • 1.5k views

ADD COMMENT • link updated 8.1 years ago by Carlo Yague 8.6k • written 8.1 years ago by bshifaw ▴ 50

0

Entering edit mode

Why don't you try first just 1 chromosome, to see if you don't get an error?

ADD REPLY • link 8.1 years ago by Benn 8.3k

0

Entering edit mode

You could do that with an even smaller set (3 G) as long as you randomly subsample and use the same hardware resources for the test. Hopefully the failures are independent of the dataset size otherwise this test would not mean much.

ADD REPLY • link 8.1 years ago by GenoMax 141k

score 2 · Accepted Answer · 2016-03-29

I want to know if it would be fine if i could cut 30gb from the 100gb and run it through my new workflows

It sounds like a good idea to test on smaller subset. For me you could cut even more.

For sampling, you can consider this tool. There are also some interesting ways to sample your reads in this discussion.