hi, we have sequenced a viral genome, and assembled it with 454 newbler. How can I know whether the genome is circular or linear? Should it be part of the assembly software features ( but there is no such feature) or should I use an external software? Thanks alot!
Almost added this as a comment to Pierre's answer...
Newbler is not reporting circularity, but it looks like you can find out about whether a contig is circular from its output:
We assembled a bacterial genome using newbler (shotgun and paired end reads), and it showed a small plasmid. I checked the 454ReadStatus.txt file, and it showed a number of shotgun (!) reads that were aligned with the start in the first few hundred bases, and the end in the last few (orientation '-' and '+', respectively). We also found the reverse.
I guess you can take this as an indication for circularity.
A second option would be 'make the contig circular', and cut the contig sequence in two at another position. Then, map the reads back to your contig and looks for reads mapping perfectly over the original 'breakpoint' (hope this makes sense).
If your assembler is not aware of circularity, it will probably split the genome arbitrarily at some point, in order to present a report as if it were linear. (I haven't had the opportunity to use Newbler, so I don't know how it treats its results.)
If the genome is circular, you should see some apparently badly oriented, but otherwise well mapped, read-pairs, pointing "away" from each other at the ends of the linear assembly. For each pair, the sum of their distances to the nearest contig end should each be similar to the expected insert size.
You may also find the "joining part" of the same circular sequence represented in a separate contig, so it's worth checking specifically for that.
As far as I am aware, no assembly programs produce explicitly labeled circular contigs on output, even though this would be useful for some virus, many bacteria, plasmids, mitochondria, chloroplasts etc. In practice this is not usually an issue - you are unlikely to get enough nice data for a whole circular bacterial genome to come out as one contig. For those of us interested in viral genomes or mitochondria etc it is annoying, but the sequences are small enough to manually finish.
The other posters have suggested several ideas to help you manually stitch the ends together. I would add that with one I have had an apparently circular 40kb viral genome assembly of 454 reads of come out as a linear contig of about 50kb - it had actually started repeating! Something else to check.
Finally, and probably most crucially - talk to some virologists! Some virus can form a circular form for replication, but a linear form for bundling up into viral particles. In this situation you may have to do some lab work to work out where the ends really are.
Very simple approach to check if a contig is circular:
- align all reads to the contig, pick only the best hit for each read
- concatenate the contig with itself, then redo the alignment
If some reads now align with a (much) better score, it is likely that the contig is circular.
I your program produced a full linear sequence, I would simply look (a simple
grep ?) for some reads starting with the end of the assembly and ending with the beginning of the assembly.
Here are my thoughts on this:
As you mentioned, you used shotgun-only 454 sequencing. Assuming that your viral genome is (almost) entirely sequenced:
There are no sequencing gaps, i.e. your genome did not contain segments where 454 sequencing failed and thus there are reads covering the entire genome. In this case, it would be best if you had only one contig the ends of which you could try to join by finding a read that spans the start and the end (taking orientations and complementarity into account). The more contigs you have the more impractical this approach gets and the less likely the assumption (no gaps) is.
There are 1 or 2 sequencing gaps. Find primers to try and close the gap with another sequencing method (Sanger comes to mind, as long reads are an advantage here). Again, take orientations into account to join contigs to a circular genome (or not). Again, with more contigs/gaps this gets impractical. You might want to tweak your assembler's options here.
The important thing to note is that it is not possible to distinguish between "no gaps-linear genome" and "one gap-circular genome". In order to be sure, I would try joining the ends together by either sequencing or simple PCR products.
Maybe a way too limple tought on this, but it could work:
As mentioned earlier by some answers the assembly will probably break at some point of low coverage. However, you could slightly change your input sequence dataset (for instance delete the reads within a segment of the genome assembly). A reassembly will in this case be pushed towards a defined break at the point where you deleted the sequences. However, you are now able to check whether the first assembly ends are joined or still present as ends in the new assembly. Slightly fiddling around with which reads to delete might give you a good answer on circularity (in addition to for instance the functional annotation which could also give a clue of arbitrary breaking of the assembly or being real (ragged) ends).
Geneious has a circular-capable assembler built in now: