3.3 years ago by
Walnut Creek, USA
I would assemble the whole dataset, shred your reference, and find the parts of your assembly that don't match the reference. For example, using BBMap, and assuming you have 2x150bp reads:
bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq minlen=90 ktrim=r k=23 mink=11 hdist=1 tbo tpe ref=adapters.fa maxns=0 qtrim=r trimq=10
bbduk.sh in=trimmed.fq out=filtered.fq ref=phix174_ill.ref.fa.gz,sequencing_artifacts.fa.gz k=31
bbmerge.sh in=filtered.fq out=ecco.fq ecco mix strict adapters=default
tadpole.sh in=ecco.fq out=ecct.fq ecc
Is all that really necessary? Well, no, not if you can get a perfect assembly without it; but you usually can't. There's more preprocessing you can do as well, but for a virus, this is probably sufficient. Incidentally, all of the reference fastas mentioned are in /bbmap/resources/
tadpole.sh in=ecct.fq out=contigs.fa k=75
Use Tadpole or SSAKE or whatever gives the best assembly, but Tadpole usually works well for viruses. If your reads are longer than 150bp, use a longer kmer; half the read length is usually fine.
shred.sh in=reference.fa length=300 overlap=280 minlength=200 out=shreds.fa
bbmap.sh in=shreds.fa ref=contigs.fa out=mapped.sam bincov=bincov.txt covbinsize=10 ambig=all delcov=f
callvariants.sh in=mapped.sam out=vars.vcf ref=contigs.fa
bincov.txt now contains the locations of your assembly covered by the reference, so the non-covered locations are the inserts. Additionally, vars.vcf will contain the exact insert sequences and positions; they will be considered deletion events (since they are present in the "reference" [contigs.fa] but not the "reads" [shreds.fa]). Of course, some of them will be reverse-complemented since the assembled contigs have a random strand. Mapping the contigs to the original reference might give you the insertions as insert events with the original reference coordinates, so it's worth trying, but it's much easier to call deletions than insertions so that approach is more reliable.