Shortening Whole genome for Benchmarking
1
0
Entering edit mode
5.2 years ago
bshifaw ▴ 50

I wanted to test new pre processing bioinformatics workflows. I have a pair of whole genome sequences, FASTQ files ~100gb each. Running this data set takes a mighty long time to complete, and running them on a new workflow they take a mighty long time to fail. I want to know if it would be fine if i could cut 30gb from the 100gb and run it through my new workflows, that way i don't have to wait a week before it fails due to a silly mistake?

sequence benchmarking • 1.1k views
ADD COMMENT
0
Entering edit mode

Why don't you try first just 1 chromosome, to see if you don't get an error?

ADD REPLY
0
Entering edit mode

You could do that with an even smaller set (3 G) as long as you randomly subsample and use the same hardware resources for the test. Hopefully the failures are independent of the dataset size otherwise this test would not mean much.

ADD REPLY
2
Entering edit mode
5.2 years ago

I want to know if it would be fine if i could cut 30gb from the 100gb and run it through my new workflows

It sounds like a good idea to test on smaller subset. For me you could cut even more.

For sampling, you can consider this tool. There are also some interesting ways to sample your reads in this discussion.

ADD COMMENT

Login before adding your answer.

Traffic: 1497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6