Question: denovo-assembly with novel genes inserted
gravatar for aindap
3.3 years ago by
United States
aindap110 wrote:

I have a viral strain with non-viral sequences inserted in. I aligned to my reads to a reference strain. The reads that didn't align, I did a denovo assembly with SSAKE to identify my contigs of my non-viral sequences.

What I want is a final assembly of my virus strain where I know the non-viral sequences inserted in. I'm new to assembly and am uncertain how to do this. I would appreciate any help!

ADD COMMENTlink modified 3.3 years ago by Brian Bushnell17k • written 3.3 years ago by aindap110
gravatar for Brian Bushnell
3.3 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

I would assemble the whole dataset, shred your reference, and find the parts of your assembly that don't match the reference. For example, using BBMap, and assuming you have 2x150bp reads:

Preprocessing: in=r1.fq in2=r2.fq out=trimmed.fq minlen=90 ktrim=r k=23 mink=11 hdist=1 tbo tpe ref=adapters.fa maxns=0 qtrim=r trimq=10 in=trimmed.fq out=filtered.fq ref=phix174_ill.ref.fa.gz,sequencing_artifacts.fa.gz k=31 in=filtered.fq out=ecco.fq ecco mix strict adapters=default in=ecco.fq out=ecct.fq ecc

Is all that really necessary? Well, no, not if you can get a perfect assembly without it; but you usually can't. There's more preprocessing you can do as well, but for a virus, this is probably sufficient. Incidentally, all of the reference fastas mentioned are in /bbmap/resources/

Assembly: in=ecct.fq out=contigs.fa k=75

Use Tadpole or SSAKE or whatever gives the best assembly, but Tadpole usually works well for viruses. If your reads are longer than 150bp, use a longer kmer; half the read length is usually fine.

Post-assembly: in=reference.fa length=300 overlap=280 minlength=200 out=shreds.fa in=shreds.fa ref=contigs.fa out=mapped.sam bincov=bincov.txt covbinsize=10 ambig=all delcov=f in=mapped.sam out=vars.vcf ref=contigs.fa

bincov.txt now contains the locations of your assembly covered by the reference, so the non-covered locations are the inserts. Additionally, vars.vcf will contain the exact insert sequences and positions; they will be considered deletion events (since they are present in the "reference" [contigs.fa] but not the "reads" [shreds.fa]). Of course, some of them will be reverse-complemented since the assembled contigs have a random strand. Mapping the contigs to the original reference might give you the insertions as insert events with the original reference coordinates, so it's worth trying, but it's much easier to call deletions than insertions so that approach is more reliable.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Brian Bushnell17k

Thank you so much!!! tadpole worked much faster and I got longer contigs

The bbmap step helped me find the insert point of my novel genes in the virus

I'm new to assembly, but this pipeline was just what I was looking for.

ADD REPLYlink written 3.3 years ago by aindap110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1835 users visited in the last hour