Question: How Do Researchers Choose A Reference Genome For A Novel Bacterial Strain Assembly?
4
gravatar for tanjafiegel
7.1 years ago by
tanjafiegel40
UK -- NCL
tanjafiegel40 wrote:

Could someone please let me know how one makes the best informed decision on choosing a reference genome to assemble a novel bacterial strain in the real "world of bioinformatics?

Is it appropriate to assemble raw sequence data into contigs, then' blastn' one of the larger contigs to find a similar strain and attempt reference genome assembly with that 'match'?

Is it then informative to find the ORFs with Glimmer3, or will the assembled consensus sequence be actually uninformative as it will contain parts of the reference genome?

What about the 'un-assembled contigs that are left? What do people usually do with those? Chuck them in the recycling or try and find some annotation for those?

Could I also ask if people mostly run Glimmer3 on the finished consensus sequence or on the contigs assembled from the raw seq reads?

Many thanks!

assembly • 3.5k views
ADD COMMENTlink modified 7.1 years ago by ALchEmiXt1.9k • written 7.1 years ago by tanjafiegel40
4
gravatar for Raquel Tobes
7.1 years ago by
Raquel Tobes140
Spain
Raquel Tobes140 wrote:

I think that, for bacterial genomes, de novo assembly is always better since the assembly using a reference genome inevitably causes bias to the reference genome.

ADD COMMENTlink written 7.1 years ago by Raquel Tobes140
1
gravatar for ALchEmiXt
7.1 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

What we usually (as in not always) do is that we de novo assemble the genome using various settings depending on the sequencing technique used (i.e. kmer size for illumina data).

  • If PE of mate-pair is available build scaffolds (de novo) of that.
  • If these are not available we use the contigs by itself:

    • we scaffold the contigs based on a closely related strain or species (which is dangerous because it could be different). The strain is either known from expert biologist or can be identified by homology searching using a chronologically joined artificial chromosome. We BLAT or BLAST the contigs to a ref or use MUMmer tiling; layout the contigs in order and orientation and just add the non-mapped contigs at the end.
    • We then link the laid-out contigs using artificial linkers clearly separating the contigs but also containing all six-frame start-stops.
  • Predict ORFs using genemarkHMMp and prodigal and compare these results to identify erroneous or missed calls.

  • If possible confirm these CDS using RNAseq experiments
  • Do further annotation and straincomparisons...

My 2ct.

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by ALchEmiXt1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1256 users visited in the last hour