It is easy to do it manually on your favorite genome, but I need to write a code that can split a fasta index into sets of scaffolds in the same order, based on size.
An example would be (only using the first 2 columns of an fai file):
Scaffold1 100
Scaffold2 50
Scaffold3 200
Scaffold4 500
If I want to split it into 2, my code should give me Scaffold 1,2,3 (total length of which adds up to 350) in first file and Scaffold 4 (with size of 500) in second file. If I want to split it into 3, my code should give me Scaffold 1,2 (total 150), Scaffold 3 (size 200), and Scaffold 4 (size 500) as 3 separate files.
I need this for genomes with over 30,000 scaffolds to split a set of jobs to run them on multiple sets of VCF regions simultaneously. Is there any program that does this, or anyone has a simple code, or suggestion to write?
Update=If any program can do this on the VCF directly, splitting it into sets of scaffolds, total length of which would be nearly equal, that would even be better! Note=It is also easy to split a VCF by lines, which I can't use, because I can't have the same scaffold in more than 1 file, each scaffold should exist in only 1 file at the end of splitting (where 1 file can of course have multiple scaffolds).
Update2=It would also work if I split into nearly total of 50million bases for example, so when the addition of X scaffold lengths reach 50million, the code outputs all those scaffolds and starts adding up the next ones to reach 50million again.
Thanks
The index file should be relatively small. You could read it from a central location. So are you sure you will need to do this?
Thanks for your comment. Yes the fai itself is small, but the jobs I want to run on the total VCF take so long time, and I need this to be able to split the VCF into nearly equal chunks. I used to do this by manually splitting the fasta index by eyeballing, but I need to make a program or code do that step now for new genomes to come..
is it what you want: Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps ?
thanks, very similar actually, but I can't split the chromosome into chunks, I need 1 scaffold to exist only in 1 file, and I am only interested in total size of the scaffold. I am trying to do it on an fai file using awk now; will post my code here if I manage.
you don't have to split the chromosomes. You can use the whole chromosomes as a whole BED record.