Question: Split fasta file into chromosomes
I have 105 bacterial isolates which are assembled into contigs. I also have a really good, fully assembled, reference genome, consisting of 2 chromosomes.

I would like to align each of my isolates to the reference genome, determine which contigs/sequence belongs to which chromosome and therefore split every fasta sequence into 2 chromosome.

I have been told that mauve is good for this. However, I have 105 isolates and mauve doens't seem to be able to cope with this much data at once. I could align smaller groups to the reference genome at a time. However, is there another way/tool to do this?


Do you only want to split the reference into two chromosome files so you can use mauve on them separately?

No, I know how to split my reference into two chromosome files... I need to split all of my 105 isolates into two chromosome files.

Based on what criteria? What format are the files currently in?

Fasta format.

I either want to split the contigs or just crudely split the fasta sequence, by aligning them to the reference genome.

Have you tried to use mauve with the two chromosomes independently? Assuming there is no significant homology between the two chromosomes that would allow you to locate contigs from each isolate, which you can then split using a program called faSomeRecords from Kent Utilities. That may be a lot of mauve runs but it can work.

Other option is to try lastz, which was designed for chromosome sized sequences to identify the contigs you need.

Sorry I'm new to this, but you know how to use the file that gives mauve from 5 to 3 to know which contigs belong to chromosome 1 and which to chromosome 2??

Maybe Satsuma is a good option for your task. Satsuma is a tool that reliably aligns large and complex DNA sequences providing maximum sensitivity, specificity and speed. I've used it to align contigs against well assembled related genomes and sort the contigs according to their hit location on the reference.

