I wanted to test new pre processing bioinformatics workflows. I have a pair of whole genome sequences, FASTQ files ~100gb each. Running this data set takes a mighty long time to complete, and running them on a new workflow they take a mighty long time to fail. I want to know if it would be fine if i could cut 30gb from the 100gb and run it through my new workflows, that way i don't have to wait a week before it fails due to a silly mistake?
Why don't you try first just 1 chromosome, to see if you don't get an error?
You could do that with an even smaller set (3 G) as long as you randomly subsample and use the same hardware resources for the test. Hopefully the failures are independent of the dataset size otherwise this test would not mean much.