Question

How to whole genome assemble when I have three references

2

Entering edit mode

19 months ago

sglass ▴ 30

Hello, I am new to bioinformatics and am having trouble with an assembly. I have illumina miseq short pair-ended reads from an organism that has multiple yeast chromosomes and a single bacterial chromosome with portions of a plasmid inside. I have references for all three organisms.

I want to know how I can get the bacterial chromosome assembled entirely. I need it assembled entirely and then to annotate it, because my PI is not versed enough in sequencing information to receive any other format.

Is there a open source package that will allow me to assemble to a reference, when I expect it to not be exactly the same. Or is there a combine of packages I can use to do this? I have contigs and scaffolds from a spades assembly. I am not sure where to go from there.

miseq plasmid assembly illumina beginner • 1.8k views

ADD COMMENT • link updated 18 months ago by d-cameron ★ 3.0k • written 19 months ago by sglass ▴ 30

0

Entering edit mode

What is your final goal? Is it genome assembly or something else. This question feels like an XY problem where you're asking for help with genome assembly when what you really something else. It's possible that there is a better way to answer your actual scientific question that doesn't involve genome assembly.

when I expect it to not be exactly the same

For example, if you a similar/ancestral bacteria has already been sequenced+annotated, you could perform variant calling against that to find the differences instead of genome assembly.

ADD REPLY • link 19 months ago by d-cameron ★ 3.0k

1

Entering edit mode

The absolute goal with this data is to determine the sequence of the Yeast Artificial Chromosome of the sample. The yeast artificial chromosome should be the bacterial chromosome with the genes of interest from the plasmid. That is why I have been focused on assembling the data. I was able to align the data to my references, however I was unable to create an alignment that showed how my plasmid had inserted into the bacterial chromosome. I am having trouble moving forward from here, but am going to give variant calling a try.

We had asked our sequencing facility for long reads, and were advised to begin with short reads. We were advised to pursue a hybrid assembly if short reads did not answer the question.

It is not essential for me to do whole genome assembly, my PI just wants the information regarding SNPs and to know where the genes from the plasmid have inserted into the chromosome. Is this something I can learn from doing variant calling from my alignment? I have an alignment from Bowtie2 and CLC, aligned separately to my bacterial chromosome, yeast chromosome and plasmid. I want my plasmid alignment and bacterial chromosome alignment to be combined.

This is my first experience with BioStars so I did not post my question with enough detail. Is there any information I can provide to resolve the XY issue.

ADD REPLY • link 19 months ago by sglass ▴ 30

1

Entering edit mode

just wants the information regarding SNPs and to know where the genes from the plasmid have inserted into the chromosome. Is this something I can learn from doing variant calling from my alignment?

If you already have a reference genome for the ancestral (.i.e. pre-plasmid) strain then yes you can. If you do then you can get away with variant calling. When I did this sort of analysis I ran a SNV caller, a SV caller, and a CNV caller. I created reference genome containing the ancestral strain + plasmid, aligned the reads with bwa mem, used GRIDSS to detect the breakpoints, then manually inspected and QCed the detected breakpoints with IGV (to ensure that a) the plasmid was inserted as intended, and b) there were no off-target structural changes). CNV calling was used to verify the #copies of the plasmid inserted (e.g. in case I had 3.5 tandemly duplicated copies of the plasmid inserted so while it looked like one of plasmid genes was only partially inserted, there were additional full-length copies in there as well) and that there were no losses elsewhere. Ideally, you sequenced the yeast before and after plasmid insertion so you can tell whether a (e.g. SNV) difference between the reference and your sample was already present in your unmodified strain.

Alternatively, there may be tools & workflows that are targeted to your specific application. A quick literature search found this paper that looks very close to what you want to do (but does require short+long reads).

ADD REPLY • link 18 months ago by d-cameron ★ 3.0k

score 1 · Answer 1 · 2023-12-19

1

Entering edit mode

19 months ago

colindaven 7.7k

Short reads are not long enough to do a perfect de novo assembly. You can likely only get as far as contigs or scaffolds.

You could add Hi-C data but it is probably overkill. I would suggest resequencing using pacbio or Oxford nanopore (even a flongle flowcell for ca 100 euros might help a lot) on your friendly neighbouring labs Minion or at a core facility.

You would use the short reads for assembly polishing downstream.

If you can't do that you can use something like JBrowse2 or CLC to check the synteny of your current assembly - if contigs are long enough - but this will always be an incomplete and likely incorrect answer.

Long read assemblies have been standard for many years now, so it is a shame your PI thought short reads were a great idea for a complex problem.

Edit - if you have to stick with the contigs you can play with software like Ragtag to scaffold the contigs using a related reference genome. https://github.com/malonge/RagTag

ADD COMMENT • link 19 months ago by colindaven 7.7k

0

Entering edit mode

Short reads may not necessarily be a problem for the hypothesis the experiment is trying to solve, but it they're definitely not good for assembly. For example, this could be a differential gene expression experiment with RNA-seq and the PI is just as clueless and just asked for a fully annotated reference genome because the usual DE pipelines need that reference data to run.

OP hasn't actually told us what their actual problem is, just that their PI has tasked them with this because their "PI is not versed enough in sequencing".

ADD REPLY • link 19 months ago by d-cameron ★ 3.0k

0

Entering edit mode

I don't know - OP did mention genome assembly 3 times plus include the tag assembly, I think that's sufficient. In the real world unfortunately lots of PIs dump poor or insufficient datasets on their students, and ask them to obtain all the answers with 100 per cent confidence. I hope it's not an RNA-seq question since if so, I have been completely mislead :-)

ADD REPLY • link 19 months ago by colindaven 7.7k

1

Entering edit mode

Genome assembly is definitely the task OP was asked to by the PI to do but I suspect getting an assembly isn't the end-point for the project. It looks like an endosymbiosis project so maybe the PI wants to know how their constructed organism differs from the yeast+bacteria it was made from. Or maybe they want to know where the plasmid was inserted. Both of these end goals can be achieved using assembly but neither of these require assembly. You could do SNP+SV+CNV against the ancestral strain for the former and even get away with just SV calling for the latter.

It would be good if OP told us what they were actually trying to achieve. This question feels very much like an XY problem.

ADD REPLY • link 19 months ago by d-cameron ★ 3.0k