Question

How Do Researchers Choose A Reference Genome For A Novel Bacterial Strain Assembly?

4

Entering edit mode

13.2 years ago

tanjafiegel ▴ 40

Could someone please let me know how one makes the best informed decision on choosing a reference genome to assemble a novel bacterial strain in the real "world of bioinformatics?

Is it appropriate to assemble raw sequence data into contigs, then' blastn' one of the larger contigs to find a similar strain and attempt reference genome assembly with that 'match'?

Is it then informative to find the ORFs with Glimmer3, or will the assembled consensus sequence be actually uninformative as it will contain parts of the reference genome?

What about the 'un-assembled contigs that are left? What do people usually do with those? Chuck them in the recycling or try and find some annotation for those?

Could I also ask if people mostly run Glimmer3 on the finished consensus sequence or on the contigs assembled from the raw seq reads?

Many thanks!

assembly • 5.2k views

ADD COMMENT • link updated 13.1 years ago by ALchEmiXt ★ 1.9k • written 13.2 years ago by tanjafiegel ▴ 40

score 4 · Answer 1 · 2012-05-01

4

Entering edit mode

13.2 years ago

Raquel Tobes ▴ 160

I think that, for bacterial genomes, de novo assembly is always better since the assembly using a reference genome inevitably causes bias to the reference genome.

ADD COMMENT • link 13.2 years ago by Raquel Tobes ▴ 160

score 1 · Answer 2 · 2012-05-01

What we usually (as in not always) do is that we de novo assemble the genome using various settings depending on the sequencing technique used (i.e. kmer size for illumina data).

If PE of mate-pair is available build scaffolds (de novo) of that.
If these are not available we use the contigs by itself:
- we scaffold the contigs based on a closely related strain or species (which is dangerous because it could be different). The strain is either known from expert biologist or can be identified by homology searching using a chronologically joined artificial chromosome. We BLAT or BLAST the contigs to a ref or use MUMmer tiling; layout the contigs in order and orientation and just add the non-mapped contigs at the end.
- We then link the laid-out contigs using artificial linkers clearly separating the contigs but also containing all six-frame start-stops.
Predict ORFs using genemarkHMMp and prodigal and compare these results to identify erroneous or missed calls.
If possible confirm these CDS using RNAseq experiments
Do further annotation and straincomparisons...

My 2ct.