Question: Thousands of contigs in E.coli assembly
0
gravatar for vitorgomesbio
9 months ago by
vitorgomesbio0 wrote:

Hi!

I recently started to study bioinformatics and need help. I assembled and annotated eleven bacterial genomes. After the annotation, I came across thousands of contigs. When I ran BLASTn, I realized that each contig made many different alignments. How can I identify the correct strain of my bacteria? Should I just select one single contig or is there any tool to merge them into just one sequence?

prokka contigs assembly genome • 487 views
ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 20 • written 9 months ago by vitorgomesbio0
1

Assuming your assembly is valid then the top hit for each contig should give you a good idea of what genome (at genus level for sure perhaps deeper) the sequence belongs to. If you have thousands of contigs for 11 genomes then you probably don't have good assemblies. I suggest that you check them with Quast.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax68k

Thank you!

I trimmed my sequences using the Trim Galore tool. Each one of the eleven genomes presented thousands of contigs even after triming. Is there any problem going on then?

ADD REPLYlink written 9 months ago by vitorgomesbio0

For example, the first contig of one of my genomes resulted in several alignments. The first one was this:

Select seq CP003295.1 Streptococcus infantarius subsp. infantarius CJ18, complete genome 48348 (max score) 72062 (total score) 99% (query cover) 0.0 (E.value) 96% (Ident)

can I infer that this is my strain?

ADD REPLYlink modified 9 months ago • written 9 months ago by vitorgomesbio0

can I infer that this is my strain?

If majority of the contigs consistently show hits to Streptococcus infantarius for that one sample then it can be a reasonable conclusion. You would want to use a tool like Mauve to see how your contigs align to the reference (if one is available) and how many holes/gaps you still have in your sequence.

I trimmed my sequences using the Trim Galore tool. Each one of the eleven genomes presented thousands of contigs even after triming. Is there any problem going on then?

Trimming sequences is only first step towards assembly. If you are not getting reasonable assemblies then there are multiple possibilities. You may have non-comprehensive/under-represented libraries. You may also have too much sequence coverage (it may sound odd but having really deep coverage also leads to problematic assemblies). You will have to down-sample your data before assembling in that case. Can you tell us if one or the other is the case here?

Were these strains sequenced/assembled independently?

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax68k

Are the eleven genomes from eleven isolated cultures? You may also have contamination, ckeck BlobTools, it is a useful tool for both helping identify the species, and detect possible contaminants.

ADD REPLYlink written 9 months ago by h.mon25k

If you've got that many contigs, it suggests your sequencing quality and/or assembly wasn't good to start with. It sounds a lot like contamination. Proceed with caution, if you plan to do more with this data. Even just annotating is probably more than poor data justifies.

ADD REPLYlink written 9 months ago by jrj.healey12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 898 users visited in the last hour