Question: Plasmid Genome Assembly
1
gravatar for Sudhir Jadhao
3.3 years ago by
India
Sudhir Jadhao60 wrote:

Which will be best tool for plasmid assembly for illumina truseq data.

I have used velvet and spades but result are not good.

Can anyone please suggest me assembler or parameter for spades and velvet that will give good assembly of plasmid

myposts • 2.7k views
ADD COMMENTlink modified 3.3 years ago by piet1.6k • written 3.3 years ago by Sudhir Jadhao60
1
gravatar for piet
3.3 years ago by
piet1.6k
planet earth
piet1.6k wrote:
> I am getting around 700 contig and 6 mb genome

6 Mb is the typical size of a whole bacterial genome. Bacterial genomes are comprised of a chromosome (usually only one) and some or several plasmids. If you have prepared the DNA from a single colony then you should get less than 100 contigs. 700 contigs indicates that either your DNA was not homogeneous (eg contaminated with a second strain of bacteria) or that the coverage of the chromosome is very low.

How have you prepared the DNA? Have you done any step to separate plasmidic DNA from chromosomal DNA? Such separations are never 100 % selective! My guess is, that there was still enough chromosomal DNA which was sequenced with low coverage. Therefore the chromosomal DNA is dispersed over hundreds of contigs. 

The FASTA file emitted by Spades reports the coverage of every contig.

>NODE_1_length_711720_cov_34.8955_ID_4768
>NODE_24_length_3121_cov_199.103_ID_4814

Please sort your contigs by coverage. Then inspect the contigs with the highest coverage. They will presumably comprise plasmidic sequences (or the highly redundant rRNA genes). 

 

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by piet1.6k
0
gravatar for Adrian Pelin
3.3 years ago by
Adrian Pelin2.2k
Canada
Adrian Pelin2.2k wrote:

The assemblers you tried should be able to do the job, provided that you have tried a reasonable amount of assemblies with varying parameters. Whether you are assembling a plasmid or not makes little differences, please provide more info with regards to your dataset. For instance, sequencing depth of plasmid, length of reads, paired or not paired, is there anything else that is being sequenced? Also would be good to show what command lines you have already tried with velvet and spades, and tell us why the results are not good.

I suspect the problem is the data, and not the assembler. I am dealing with plasmid assembly myself and I notice problems with variable coverage, probably something to do with the biology of plasmid replication, since this variability is consistent among 2 different sequencing methods.

ADD COMMENTlink written 3.3 years ago by Adrian Pelin2.2k

Many thanks for replay

sequencing depth of plasmid, => 550X

length of reads,  150

paired or not paired, => paired

is there anything else that is being sequenced => no

t command lines

1.velvet=>
VelvetOptimiser.pl -t 50  --p Sample12 --d sample12 --a -o "-min_contig_lgth 200 -scaffolding yes"  -f '-fastq -shortPaired  R1_001_150.fastq R2_001_150.fastq'

                             AND

velveth 1_Output_velveth 69,73,2 -fastq -shortPaired -separate R1_001.fastq_filtered R2_001.fastq_filtered 

velvetg inputkmer -cov_cutoff auto -read_trkg yes -min_contig_lgth 200 -amos_file yes -ins_length auto  -exp_cov auto -ins_length_sd 50 -scaffolding yes

 

2 spades

SPAdes-3.5.0-Linux/bin/spades.py -o SO_5216_BND11_S11_L001_1  -k 21,33,55,77 --careful --only-assembler -1 R1_001_150.fastq -2 R2_001_150.fastq -t 20

         AND

spades.py -o S11_L001 -1 R1_001_150.fastq_filtered -2 R2_001_150.fastq_filtered -t 30 -k 41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81 --cov-cutoff auto

 

The output I am getting is around 6 mb genome and scaffold  ~700 . that is too far from expected results

 

 

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Sudhir Jadhao60

My recommendations:

  • have a look at a kmer distribution to see if you have a sequencing bias, contaminations, actually 550X ... Expect a second smaller peak at ~1100X, which represents inverted repeats of the plasmid
  • subsample to 100X (take only ~20% of your read data)
  • get latest spades (3.6.x), run with default settings including error correction.
  • map reads to contigs and remove stuff with low coverage -> contaminations
  • plasmids usually comprise one or more inverted repeats that are part of the replication mechanism. Those often cannot be resolved properly by the assembler, but you can identify these regions/contigs by coverage as well - should be double. You will probably have to copy and paste these contigs together by hand
ADD REPLYlink written 3.3 years ago by thackl2.6k

So there is something else being sequenced, the nuclear genome. If that genome is available, or if you can assemble it, then I would try to filter out the reads that map to the nuclear genome (provided that you do not have high identity regions common to both the nuclear genome and the plasmid). Very odd that you get only 2 contigs, that you are able to assemble the nuclear genome so well, if not fully and cannot assemble the plasmid.

ADD REPLYlink written 3.3 years ago by Adrian Pelin2.2k

All of the suggestions given so far are good. I would add that you can try our tool Recycler. It takes into consideration some of the same features as suggested here - coverage, circularity of sequences, and paired end mapping. I posted more details here (and in the links therein): Recycler for plasmid assembly

ADD REPLYlink written 3.0 years ago by Roye Rozov90
0
gravatar for Brian Bushnell
3.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I would generally recommend Spades as the best assembler for things like plasmids.  But considering that you have tried it, what, specifically, is the problem?  Do you get too may contigs, or does it not assemble a all?

ADD COMMENTlink written 3.3 years ago by Brian Bushnell16k

Yes it is assembling , I am getting around 700 contig and 6 mb genome, that is too far from our expectations

ADD REPLYlink written 3.3 years ago by Sudhir Jadhao60

I would say this is a common problem with the nowadays "short" sequencing technology. A colleague of mine, tried to sequence a short genome, and he needed 7 years to fully complete it, and he eventually did it by using PacBio sequencing.

I think you need to use more than short reads. As in your case, you eventually discover that assembly noes not improve even though you increase the coverage of what you are sequencing.

The use of mate-paired reads will help you by doing a better scaffolding of your contigs. If your plasmid is a commercial one, a comparison with trusted and similar plasmids using programs like Mauve will help a lot in the task of ordering the contigs. You can also combine several kind os sequences, like the regular Illumina, mate-pairing, long Illumina reads and/or PacBio sequences. Otherwise, I think you will be hitting a hard task

ADD REPLYlink written 3.3 years ago by Antonio R. Franco4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1068 users visited in the last hour