Question: Missing sequence from a cosmid de novo assembly
gravatar for kspata
23 months ago by
kspata70 wrote:

Hi All,

I performed de novo assembly for a cosmid sequenced on NextSeq PE 300 using SPADES. The pipeline i used is as follows:

1.Trim the sequence to remove low quality bases 2. Extract a subset of reads 3. Perform SPADES de novo assembly.

The expected length of cosmid was 50Kb while I got a sequence length of around 47.5kb. This cosmid contained an overlapping region with another cosmid and the overlapping sequence was PCR amplified and sequenced confirming its presence.

The length of the overlapping sequence is 990bp and it is not present in the assembled sequence.

I have looked through the contigs.fasta file obtained from the SPADES output and this sequence is not present in other contigs as well.

What approach should I use to search for this missing sequence in the raw data or the assembled data? How can I justify the absence of this sequence from the assembled genome?


ADD COMMENTlink modified 23 months ago by harold.smith.tarheel4.6k • written 23 months ago by kspata70
gravatar for harold.smith.tarheel
23 months ago by
United States
harold.smith.tarheel4.6k wrote:

Two easily testable possibilities:

1) Spades failed to assemble the reads for this segment. 2) Reads for this segment are not present in your sample/data.

You can discriminate by aligning your data to the sequence in question.

ADD COMMENTlink written 23 months ago by harold.smith.tarheel4.6k

Thank you for replying.

I performed further troubleshooting by searching for substrings of missing sequence in the contigs fasta file but did not find any match for substrings of length 50bp, 80bp, and 100bp.

  1. What other assembly tools or strategies can I use to troubleshoot this?
  2. Should I try merging the paired end reads and perform assembly using SPAdes on the merged data treating them as Single end reads?
  3. Will sequencing using PacBio help? Can I use either canu/pilon or any hybrid assembly approach to get the complete de novo assembled sequence of the cosmid (50kb)?

Please guide me for the same.

ADD REPLYlink written 22 months ago by kspata70

Why would you search for the missing sequence in the assembled contigs, when you've already said that it's missing? I recommended aligning your data (i.e., your reads) to the missing sequence. Or, you can parse that data for substrings.

ADD REPLYlink written 22 months ago by harold.smith.tarheel4.6k

It is present in the sample cosmid DNA as confirmed by PCR sequencing. But i guess it was either not sequenced or SPAdes failed to assemble. 1. Illumina sequencing failure can be confirmed by mapping forward and reverse reads to the missing DNA sequence which resulted in 0% mapping rate.

ADD REPLYlink written 22 months ago by kspata70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1769 users visited in the last hour