Question: Shortening Whole genome for Benchmarking
0
gravatar for bshifaw
3.5 years ago by
bshifaw50
United States
bshifaw50 wrote:

I wanted to test new pre processing bioinformatics workflows. I have a pair of whole genome sequences, FASTQ files ~100gb each. Running this data set takes a mighty long time to complete, and running them on a new workflow they take a mighty long time to fail. I want to know if it would be fine if i could cut 30gb from the 100gb and run it through my new workflows, that way i don't have to wait a week before it fails due to a silly mistake?

benchmarking sequence • 882 views
ADD COMMENTlink modified 3.5 years ago by Carlo Yague4.6k • written 3.5 years ago by bshifaw50

Why don't you try first just 1 chromosome, to see if you don't get an error?

ADD REPLYlink written 3.5 years ago by Benn7.7k

You could do that with an even smaller set (3 G) as long as you randomly subsample and use the same hardware resources for the test. Hopefully the failures are independent of the dataset size otherwise this test would not mean much.

ADD REPLYlink written 3.5 years ago by genomax71k
2
gravatar for Carlo Yague
3.5 years ago by
Carlo Yague4.6k
Belgium
Carlo Yague4.6k wrote:

I want to know if it would be fine if i could cut 30gb from the 100gb and run it through my new workflows

It sounds like a good idea to test on smaller subset. For me you could cut even more.

For sampling, you can consider this tool. There are also some interesting ways to sample your reads in this discussion.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Carlo Yague4.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 936 users visited in the last hour