Question

denovo-assembly with novel genes inserted

0

Entering edit mode

7.3 years ago

aindap ▴ 120

I have a viral strain with non-viral sequences inserted in. I aligned to my reads to a reference strain. The reads that didn't align, I did a denovo assembly with SSAKE to identify my contigs of my non-viral sequences.

What I want is a final assembly of my virus strain where I know the non-viral sequences inserted in. I'm new to assembly and am uncertain how to do this. I would appreciate any help!

Assembly denovo assembly Ilumina • 1.8k views

ADD COMMENT • link updated 7.3 years ago by Brian Bushnell 20k • written 7.3 years ago by aindap ▴ 120

score 1 · Answer 1 · 2017-02-01

I would assemble the whole dataset, shred your reference, and find the parts of your assembly that don't match the reference. For example, using BBMap, and assuming you have 2x150bp reads:

Preprocessing:

bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq minlen=90 ktrim=r k=23 mink=11 hdist=1 tbo tpe ref=adapters.fa maxns=0 qtrim=r trimq=10

bbduk.sh in=trimmed.fq out=filtered.fq ref=phix174_ill.ref.fa.gz,sequencing_artifacts.fa.gz k=31

bbmerge.sh in=filtered.fq out=ecco.fq ecco mix strict adapters=default

tadpole.sh in=ecco.fq out=ecct.fq ecc

Is all that really necessary? Well, no, not if you can get a perfect assembly without it; but you usually can't. There's more preprocessing you can do as well, but for a virus, this is probably sufficient. Incidentally, all of the reference fastas mentioned are in /bbmap/resources/

Assembly:

tadpole.sh in=ecct.fq out=contigs.fa k=75

Use Tadpole or SSAKE or whatever gives the best assembly, but Tadpole usually works well for viruses. If your reads are longer than 150bp, use a longer kmer; half the read length is usually fine.

Post-assembly:

shred.sh in=reference.fa length=300 overlap=280 minlength=200 out=shreds.fa

bbmap.sh in=shreds.fa ref=contigs.fa out=mapped.sam bincov=bincov.txt covbinsize=10 ambig=all delcov=f

callvariants.sh in=mapped.sam out=vars.vcf ref=contigs.fa

bincov.txt now contains the locations of your assembly covered by the reference, so the non-covered locations are the inserts. Additionally, vars.vcf will contain the exact insert sequences and positions; they will be considered deletion events (since they are present in the "reference" [contigs.fa] but not the "reads" [shreds.fa]). Of course, some of them will be reverse-complemented since the assembled contigs have a random strand. Mapping the contigs to the original reference might give you the insertions as insert events with the original reference coordinates, so it's worth trying, but it's much easier to call deletions than insertions so that approach is more reliable.