Hi everyone, I have 50 E.coli whole genome sequence data. I did denovo assembly with spades. As a result i have 50 contig file and 50 scaffold file. I want to extract core genome and accessory genome in separately from all 50 genome. Can any one suggest me how i can do proceed to next step?
Yet another option: Pass the assemblies into Cortex, and dump unitigs (=supernodes in Cortex jargon), and then pass them back into Cortex, using Cortex's pan_genome_matrix option - it will give you a big matrix showing you which unitigs are in which samples. Then you can make your own choices about what percentage of samples a contig needs to be in, to be considered "core". 90%? 95%? 100% etc
Roughly speaking, the command lines are
run_calls.pl --fastaq_index INDEX_SPECIFYING_SAMPLE_ID_AND_ASSEMBLY --kmer_size 21 --mem_height 21 --mem_width 100 --do_union no --auto_clean no --outdir DIR
This will make Cortex graph files of all the assemblies
Then, dump unitigs
ls DIR/binaries/unclean/31/*.ctx > list_of_binaries ls list_of_binaries > pool
cortex_var_31_c1 --kmer_size 21 --mem_height 21 --mem_width 100 --colour_list pool --output_supernodes unitigs.txt
This dumps unitigs as fasta file called unitigs.txt
And finally, dump the matrix which has first column =contig-id, second column = % of 21mers in contig in sample1, next column= % of 21mers in contig in sample 2.. etc, and rows are contigs.
for f in
ls *.ctx; do echo
pwd/$f > $f.filelist; done;
ls DIR/binaries/unclean/31/*filelist > colourlist_of_samples
cortex_var_31_c100 --kmer_size 31 --mem_height 21 --mem_width 100 --colour_list colourist_of_samples --pan_genome_matrix unitigs.txt --max_read_len <max contig="" length="" in="" unitigs.txt="">