Question: Split fai (fasta index) or VCF into N nearly equal total length chunks
gravatar for FatihSarigol
8 days ago by
FatihSarigol120 wrote:

It is easy to do it manually on your favorite genome, but I need to write a code that can split a fasta index into sets of scaffolds in the same order, based on size.

An example would be (only using the first 2 columns of an fai file):

Scaffold1 100

Scaffold2 50

Scaffold3 200

Scaffold4 500

If I want to split it into 2, my code should give me Scaffold 1,2,3 (total length of which adds up to 350) in first file and Scaffold 4 (with size of 500) in second file. If I want to split it into 3, my code should give me Scaffold 1,2 (total 150), Scaffold 3 (size 200), and Scaffold 4 (size 500) as 3 separate files.

I need this for genomes with over 30,000 scaffolds to split a set of jobs to run them on multiple sets of VCF regions simultaneously. Is there any program that does this, or anyone has a simple code, or suggestion to write?

Update=If any program can do this on the VCF directly, splitting it into sets of scaffolds, total length of which would be nearly equal, that would even be better! Note=It is also easy to split a VCF by lines, which I can't use, because I can't have the same scaffold in more than 1 file, each scaffold should exist in only 1 file at the end of splitting (where 1 file can of course have multiple scaffolds).

Update2=It would also work if I split into nearly total of 50million bases for example, so when the addition of X scaffold lengths reach 50million, the code outputs all those scaffolds and starts adding up the next ones to reach 50million again.


faidx genome • 110 views
ADD COMMENTlink modified 1 hour ago • written 8 days ago by FatihSarigol120

The index file should be relatively small. You could read it from a central location. So are you sure you will need to do this?

ADD REPLYlink written 8 days ago by genomax64k

Thanks for your comment. Yes the fai itself is small, but the jobs I want to run on the total VCF take so long time, and I need this to be able to split the VCF into nearly equal chunks. I used to do this by manually splitting the fasta index by eyeballing, but I need to make a program or code do that step now for new genomes to come..

ADD REPLYlink written 8 days ago by FatihSarigol120

is it what you want: Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps ?

ADD REPLYlink written 8 days ago by Pierre Lindenbaum118k

thanks, very similar actually, but I can't split the chromosome into chunks, I need 1 scaffold to exist only in 1 file, and I am only interested in total size of the scaffold. I am trying to do it on an fai file using awk now; will post my code here if I manage.

ADD REPLYlink written 8 days ago by FatihSarigol120

you don't have to split the chromosomes. You can use the whole chromosomes as a whole BED record.

ADD REPLYlink written 7 days ago by Pierre Lindenbaum118k
gravatar for FatihSarigol
1 hour ago by
FatihSarigol120 wrote:

I wrote a code that does exactly what I want, if anyone else needs a similar thing, you can find it here

Run it as

./ samtoolsExecutable fastaFile numberOfDivisions

to divide your fasta index into subsets of nearly equal total lengths, based on how many subsets you want, keeping each scaffold in only 1 subset. You can use the output subset index files to extract these regions (to run analyses on them that take long time normally) as they will be in bed format starting from 0 to the end of each scaffold.

ADD COMMENTlink written 1 hour ago by FatihSarigol120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2103 users visited in the last hour