It is easy to do it manually on your favorite genome, but I need to write a code that can split a fasta index into sets of scaffolds in the same order, based on size.
An example would be (only using the first 2 columns of an fai file):
If I want to split it into 2, my code should give me Scaffold 1,2,3 (total length of which adds up to 350) in first file and Scaffold 4 (with size of 500) in second file. If I want to split it into 3, my code should give me Scaffold 1,2 (total 150), Scaffold 3 (size 200), and Scaffold 4 (size 500) as 3 separate files.
I need this for genomes with over 30,000 scaffolds to split a set of jobs to run them on multiple sets of VCF regions simultaneously. Is there any program that does this, or anyone has a simple code, or suggestion to write?
Update=If any program can do this on the VCF directly, splitting it into sets of scaffolds, total length of which would be nearly equal, that would even be better! Note=It is also easy to split a VCF by lines, which I can't use, because I can't have the same scaffold in more than 1 file, each scaffold should exist in only 1 file at the end of splitting (where 1 file can of course have multiple scaffolds).
Update2=It would also work if I split into nearly total of 50million bases for example, so when the addition of X scaffold lengths reach 50million, the code outputs all those scaffolds and starts adding up the next ones to reach 50million again.