Question: How To Extract Core Genome From Whole Genome Sequence Data?
gravatar for HG
6.0 years ago by
HG1.1k wrote:

Hi everyone, I have 50 E.coli whole genome sequence data. I did denovo assembly with spades. As a result i have 50 contig file and 50 scaffold file. I want to extract core genome and accessory genome in separately from all 50 genome. Can any one suggest me how i can do proceed to next step?

ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 6.0 years ago by HG1.1k

is the "core genome" term means common genome? I would annotate genomes using the reference one, and then make a table of genes and their presence in each of 50 and their similarity to the reference - this table will help to extract any subset configured by minimal support and score thresholds

ADD REPLYlink written 6.0 years ago by Pavel Senin1.9k

Yes core genome means common region among all species. I am thinking like this way after assembly i can map all the contig with contiguator to a reference genome and i can extract map contig and unmap contig. Next step if i run a Cd hit with all those contig i can get the common contig among all 50 genome. Whats your view regarding my opinion ??

ADD REPLYlink written 6.0 years ago by HG1.1k

sounds good to me. what can possibly happen - due to the contig level of granularity - is that you will end up with no room to change parameters except using the CD-HIT similarity threshold.

ADD REPLYlink written 6.0 years ago by Pavel Senin1.9k

Yes i appreciate your opinion. If i go after annotation if will also be fine but it will take little bit more time doing in RAST or GenDB. I will try both way parallel. For Cd hit what will be ideal cut off could you please give me any idea??

ADD REPLYlink written 6.0 years ago by HG1.1k

I don't know, this is the part of the problem I was mentioning, while running cd-hit you may face the the situation when some contigs are not OK to be clustered together and some you'd rather keep together, but cd-hit will set them apart, because of lengths etc. but you may not face that - it depends on the assemblies, also it is my opinion, I might be wrong.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Pavel Senin1.9k
gravatar for 5heikki
6.0 years ago by
5heikki8.6k wrote:

Another option would be all vs all blast, some filtering, and then mclblastline..

ADD COMMENTlink written 6.0 years ago by 5heikki8.6k
gravatar for zam.iqbal.genome
6.0 years ago by
United Kingdom
zam.iqbal.genome1.7k wrote:

Yet another option: Pass the assemblies into Cortex, and dump unitigs (=supernodes in Cortex jargon), and then pass them back into Cortex, using Cortex's pan_genome_matrix option - it will give you a big matrix showing you which unitigs are in which samples. Then you can make your own choices about what percentage of samples a contig needs to be in, to be considered "core". 90%? 95%? 100% etc

Roughly speaking, the command lines are --fastaq_index INDEX_SPECIFYING_SAMPLE_ID_AND_ASSEMBLY --kmer_size 21 --mem_height 21 --mem_width 100 --do_union no --auto_clean no --outdir DIR

This will make Cortex graph files of all the assemblies

Then, dump unitigs

ls DIR/binaries/unclean/31/*.ctx > list_of_binaries ls list_of_binaries > pool

cortex_var_31_c1 --kmer_size 21 --mem_height 21 --mem_width 100 --colour_list pool --output_supernodes unitigs.txt

This dumps unitigs as fasta file called unitigs.txt

And finally, dump the matrix which has first column =contig-id, second column = % of 21mers in contig in sample1, next column= % of 21mers in contig in sample 2.. etc, and rows are contigs.

cd DIR/binaries/unclean/31 for f in ls *.ctx; do echo pwd/$f > $f.filelist; done; cd ../../../.. ls DIR/binaries/unclean/31/*filelist > colourlist_of_samples

cortex_var_31_c100 --kmer_size 31 --mem_height 21 --mem_width 100 --colour_list colourist_of_samples --pan_genome_matrix unitigs.txt --max_read_len <max contig="" length="" in="" unitigs.txt="">

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by zam.iqbal.genome1.7k

There's a lot more detail on what all those options mean in the Cortex manual:

ADD REPLYlink written 6.0 years ago by zam.iqbal.genome1.7k

Thank you for your suggestion. I will have a look and let you know update.

ADD REPLYlink written 6.0 years ago by HG1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1573 users visited in the last hour