Assembly with repetitive regions
1
0
Entering edit mode
2.6 years ago
msrch • 0

Hello all!

I am assembling a synthetic genome sequenced with Oxford Nanopore. The problem is that I obtain a very repetitive assembly with all the contigs such as ATATATATATATATATAT. I do not understand why because I have repeated the process with the Acinetobacter pittii genome, and it seems normal and similar to the reference.

I am new to Oxford Nanopore and assemblies, and although I have read the papers, I cannot understand why this is happening. Is it because the contigs only overlap in repetitive regions, and then the consensus can only use these regions to build the assembly?

Thank you in advance for any help

Flye Assembly Canu SPAdes • 1.2k views
ADD COMMENT
0
Entering edit mode

Could you specify what "synthetic genome" means in your case?

ADD REPLY
0
Entering edit mode

By synthetic genome I mean a genome of synthetic DNA for data storage. In the reference, it is made of 42,000 reads of 120 bp long. Specifically, the dataset has been taken from this publication: https://www.researchsquare.com/article/rs-27205/v1 and this GitHub: https://github.com/helixworks-technologies/dos

It is the 3xr6 dataset in the repository.

ADD REPLY
2
Entering edit mode
2.6 years ago
Michael 54k

You cannot - and do not need to - "assemble" these artificial sequences using a genome assembler, because assumptions made for genome assembly are violated by these sequences. The basic assumption of assembly is that obtained sequenced fragments are partial (or even complete) randomly distributed sub-sequences of a set of larger distinct entities: sequence replicons (e.g. chromosomes). Identical stretches of sequences (overlaps, consensus) between fragments either come from the same location of the same replicon and can therefore be used to stitch together the original replicon or are results of sequence duplication or repeats.

  • There is no greater "genome", all sequences are artifacts

  • All sequences are shorter than the read-length of the sequencing machine and can therefore be recovered in full. Additional coverage can be - and should be - used for error correction of fragments by consensus.

  • Identical sequences are artifacts and have no meaning towards a possible origin on a replicon:

All sequences in the 3xr6 oligo pool contain the same forward and reverse priming regions for PCR-compatibility.

These sequences will most likely be completely removed because of the high coverage of the identical regions that could be interpreted as adapter contamination. Further, if you trimmed those "adapters", the manuscript states that the remaining sequences are all unique (orthogonal, if I understood that correctly) and therefore would not provide further consensus information.

Thus, whatever genome assembly method is used on the data, the result is moot because the input is not a genome.

ADD COMMENT
0
Entering edit mode

I understand, I supposed that because the reads are made of orthogonal sequences concatenated, maybe there was the same concatenation in multiple reads. However, as you said, you can recover the sequence in full. Therefore, what I should have done is to just compute the consensus between those reads covering the same sequence and trim the adapters, am I right?

Thank you for the quick reply

ADD REPLY
0
Entering edit mode

If your intent is to recover the stored "information", then yes. You might also want to check the bit error rate by comparing each consensus to the reference sequence, that could tell you if error detection methods like CRC are required. Whether or not to remove the adapters is up to you.

ADD REPLY
0
Entering edit mode

Thank you very much for the answers!

ADD REPLY

Login before adding your answer.

Traffic: 2628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6