Question: Split fai (fasta index) or VCF into N nearly equal total length chunks
0
gravatar for FatihSarigol
5 months ago by
FatihSarigol120
Durham
FatihSarigol120 wrote:

It is easy to do it manually on your favorite genome, but I need to write a code that can split a fasta index into sets of scaffolds in the same order, based on size.

An example would be (only using the first 2 columns of an fai file):

Scaffold1 100

Scaffold2 50

Scaffold3 200

Scaffold4 500

If I want to split it into 2, my code should give me Scaffold 1,2,3 (total length of which adds up to 350) in first file and Scaffold 4 (with size of 500) in second file. If I want to split it into 3, my code should give me Scaffold 1,2 (total 150), Scaffold 3 (size 200), and Scaffold 4 (size 500) as 3 separate files.

I need this for genomes with over 30,000 scaffolds to split a set of jobs to run them on multiple sets of VCF regions simultaneously. Is there any program that does this, or anyone has a simple code, or suggestion to write?

Update=If any program can do this on the VCF directly, splitting it into sets of scaffolds, total length of which would be nearly equal, that would even be better! Note=It is also easy to split a VCF by lines, which I can't use, because I can't have the same scaffold in more than 1 file, each scaffold should exist in only 1 file at the end of splitting (where 1 file can of course have multiple scaffolds).

Update2=It would also work if I split into nearly total of 50million bases for example, so when the addition of X scaffold lengths reach 50million, the code outputs all those scaffolds and starts adding up the next ones to reach 50million again.

Thanks

faidx genome • 263 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by FatihSarigol120
2

The index file should be relatively small. You could read it from a central location. So are you sure you will need to do this?

ADD REPLYlink written 5 months ago by genomax70k

Thanks for your comment. Yes the fai itself is small, but the jobs I want to run on the total VCF take so long time, and I need this to be able to split the VCF into nearly equal chunks. I used to do this by manually splitting the fasta index by eyeballing, but I need to make a program or code do that step now for new genomes to come..

ADD REPLYlink written 5 months ago by FatihSarigol120
1

is it what you want: Programming Challenge: Divide The Human Genome Among X Cores, Taking Into Account Gaps ?

ADD REPLYlink written 5 months ago by Pierre Lindenbaum122k

thanks, very similar actually, but I can't split the chromosome into chunks, I need 1 scaffold to exist only in 1 file, and I am only interested in total size of the scaffold. I am trying to do it on an fai file using awk now; will post my code here if I manage.

ADD REPLYlink written 5 months ago by FatihSarigol120
1

you don't have to split the chromosomes. You can use the whole chromosomes as a whole BED record.

ADD REPLYlink written 5 months ago by Pierre Lindenbaum122k
0
gravatar for FatihSarigol
5 months ago by
FatihSarigol120
Durham
FatihSarigol120 wrote:

I wrote a code that does exactly what I want, if anyone else needs a similar thing, you can find it here

Run it as

./FASTAindexSPLITTERinEQUALsize.sh samtoolsExecutable fastaFile numberOfDivisions

to divide your fasta index into subsets of nearly equal total lengths, based on how many subsets you want, keeping each scaffold in only 1 subset. You can use the output subset index files to extract these regions (to run analyses on them that take long time normally) as they will be in bed format starting from 0 to the end of each scaffold.

ADD COMMENTlink written 5 months ago by FatihSarigol120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1113 users visited in the last hour