Question

Pair ended short reads assemble to multiple references with a plasmid also inside

0

Entering edit mode

4 months ago

sglass ▴ 30

Hello,

I am new to bioinformatics and am having trouble doing an assembly and alignment. First I will describe my sample data, I have Illumina MiSeq data of pair ended reads on a yeast organism. This yeast organism contains the chromosomes of a yeast organism with 16 chromosomes, and an additional bacterial chromosome. Within the bacterial chromosome should be a plasmid (without the plasmid my yeast would not have grown). I have references for the yeast, bacteria and plasmid.

I am interested in finding out how my bacterial chromosome differs from my reference and the location of my plasmid on the bacterial chromosome.

I got pair ended reads (two files) from a sequencing service, already demultiplexed and with fastqc reports. I quality trimmed my data using fastp (used options -q 20 -u 50). I then assembled using spades.py command (base parameters). From here I now had contigs and scaffolds produced by spades.

I used minimap2 to assemble to a reference genome, I did this for my yeast and bacterial reference to assemble my contigs. This is where I am getting lost in my assembly workflow.

I began searching for my plasmid in individual contigs, using MMseqs2 easy-search. The results of this search was four contigs out of ~1200 contigs. I blasted the four contigs using NCBI blastn, and my results showed that the plasmid was broken up among those four contigs.

I am seeking guidance on how I can accomplish the goals in the bolded section. I am concerned that my plasmid can't be found because my assembly with spades was bad.

assembly synthetic illumina alignment • 502 views

ADD COMMENT • link 4 months ago by sglass ▴ 30

1

Entering edit mode

I used minimap2 to assemble to a reference genome, I did this for my yeast and bacterial reference to assemble my contigs. This is where I am getting lost in my assembly workflow.

I can see why you are getting lost. Minimap2 is an aligner, not an assembler. And it's not relevant to short reads anyway; it's designed for long reads like Nanopore.

Since you want to know how your sample differs from a reference, you need to align your reads to the reference (meaning the bacterial genome fasta), then look at the relavent section in IGV. Since they are short reads, you should use a short read aligner (BBMap, bwa-mem, bowtie2) rather than a long-read aligner. There is no reason to do assembly in this case, it will just make things confusing. If you want, you can split the reads between your bacteria and yeast, but generally that should not be necessary since they're both small and complex so they shouldn't align to each other. But if you are expecting your bacterial reads to have a low identity to their reference then splitting them first would be a good idea (via BBSplit or Seal).

ADD REPLY • link 4 months ago by Brian Bushnell 20k

0

Entering edit mode

Mapping the reads is helpful for determining how well my data aligns to the reference bacterial chromosome. However after mapping in this way (using bowtie2) I now have read maps for the plasmid and bacterial chromosome which are separate. This is not what I want.

Based on the read mapping, I know that my plasmid is not intact in the bacterial genome. The essential working parts of the plasmid have been detected in the reads.

How can I assemble the bacterial chromosome from my reads? Do I need to do de novo assembly (I have contigs from a spades run) and work from there? The mapping does not help me because I am primarily interested in where the genes of interest from the plasmid have inserted into the bacterial chromosome. I need an annotated sequence file, are there programs or packages that can help me with this?

ADD REPLY • link 4 months ago by sglass ▴ 30