Hi all -- sorry if this is long...
I'm trying to look at how genome fragmentation during incomplete assembly of isolate genomes affects noncoding dna % prediction of the genome via prodigal.
To do so, I've used the pyfasta split genome fuction to split completed genomes from JGI into pieces, and I got a real wierd result...
Acetoanaerobium_sticklandii_DSM_519 0.8802034536429185 Acetobacterium_woodii_WB1_DSM_1030 0.8465615185692128 Acetohalobium_arabaticum_Z-7288_DSM_5501 0.8517360940187091
Acetoanaerobium_sticklandii_DSM_519.split.100Kmer 0.8977238119052345 Acetobacterium_woodii_WB1_DSM_1030.split.100Kmer 0.8634179337946196 Acetohalobium_arabaticum_Z-7288_DSM_5501.split.100Kmer 0.8686845945652649
Acetoanaerobium_sticklandii_DSM_519.split.1000Kmer 0.8978000420554741 Acetobacterium_woodii_WB1_DSM_1030.split.1000Kmer 0.863457243749161 Acetohalobium_arabaticum_Z-7288_DSM_5501.split.1000Kmer 0.8687550514335138
Acetoanaerobium_sticklandii_DSM_519.split.2000Kmer 0.8978000420554741 Acetobacterium_woodii_WB1_DSM_1030.split.2000Kmer 0.8634958120064469 Acetohalobium_arabaticum_Z-7288_DSM_5501.split.2000Kmer 0.8687562662071043
the numbers are the coding %. The way I calculate this is by taking the mrna output file from prodigal and just count the bases, multiply by 3, and divide this by the genome size. What I noticed is that the noncoding DNA % doesn't change that much as you decrease contig size within reason, but as soon as you split the genome, you get an increase in coding %, as you can see above!
Even in the case of Acetoanaerobium_sticklandi, where, at the 2 mbp split, you're just splitting the genome into TWO contigs, the coding % sees the ~1% jump...so I went into the FASTAs to take a look, and the ONLY difference between the two is that when the sequence is split, it linearizes the fasta file.
I tried linearizin the whole genome by wy of scrip as well as using other splitting software that also linearize, and it seems like the discrepancies are coming from linearizing the DNA.
So..long story short, I'm getting different prodigal outputs from having fastas simply reformatted by getting rid of line breaks. Has anyone else noticed this issue and what the source of the problem is?
thanks!
Jon