Question

RNA-Seq analysis for (bacterial) substrains that differ from the reference genome

0

Entering edit mode

5.1 years ago

g • 0

I've got an RNA-Seq project that got more complicated when it became apparent the reference genome was more different from my strains than anticipated. I'm trying to figure out the best (or least bad) sequence to map my reads against, and could use some help, since I am probably overthinking things at this point.

Things I have:

Two bacterial strains (A and B). B was generated by random chemical mutagenesis of A (number & type of mutations are unknown.) They turn out to be different enough from the type strain C that they may constitute a distinct substrain.
Timecourse RNA-Seq data for both A & B (single, not paired end). Multiple timepoints per biological sample, several biological replicates.
Closed, annotated reference genome for the type strain C.
Incomplete draft genomes for both A & B (Illumina sequencing). However, they're split across too many contigs. There are some low-complexity regions, near-identical gene copies, and other features that complicate assembly, along with some reads from a difficult-to-eradicate contaminant strain. Multiple assembly methods still leave them a bit of a mess.
Simple variant calling for A & B against C using the reads obtained for the A & B draft genomes.
It will be hard to convince other people on this project to spend the money to close or significantly improve the A & B drafts via PacBio or something unless it really is necessary. We didn't originally budget for genome sequencing.

Data I want to be able to assemble:

Differential gene expression within one substrain (A-only or B-only) across timepoints or between A & B at one timepoint (nothing fancy - pairwise DESeq2 comparisons and I may also try some timecourse-focused tools).
There are genes of particular interest that are directly affected by assembly issues in the current A & B draft genomes. I nevertheless want to make sure that my differential gene expression analyses for them are as decent as possible.
I want to be reasonably sure about any differences in gene sequences for A & B vs. C, for interpreting paired proteomics data.

Concerns I have about about mapping my A/B RNA-Seq reads against specific sequences:

If I align the A & B RNA-Seq reads against their respective draft genomes, it seems to me that there may be some gene-specific problems due to genome assembly issues. For example, I know at least one gene of interest is split across two contigs, and there are (known) duplicated genes that may be mis-counted because they show up as 1 tiny contig in the draft.
If I align the A & B RNA-Seq reads against C, that avoids the draft genome issues seen for A/B, but obviously there will be more mismatches, which may affect analysis of differential expression in a non-uniform way, and which won't catch any areas where the differences may be too large to map reads to C at all.
If I align the A & B RNA-Seq reads against pseudo-reference sequences, generated by using the A & B variant calling data to modify the C reference genome, I may also run into other issues related to variant calling artifacts.

What's the least-bad way to try to analyze this dataset?

RNA-Seq rna-seq sequencing genome • 1.1k views

ADD COMMENT • link updated 5.1 years ago by Asaf 10k • written 5.1 years ago by g • 0

0

Entering edit mode

I think the right answer here is to spend the money, by hook or by crook, and just do the genomes for A and B. Hybrid illumina and nanopore sequencing at one of the facilities I know of is only about £120 a genome - surely there's some slack in the budget or slush money floating around to cover that?

ADD REPLY • link 5.1 years ago by Joe 21k

0

Entering edit mode

Unfortunately, as an external user, pricing for Nanopore or PacBio at facilities I know of is closer to $1k/genome (including library prep) and we'd of course be doing two genomes. I'll take another look and see if I can find any facilities with a price closer to the range you mentioned, since that would certainly be an easier sell.

Perhaps I could also try to split the difference by completing A, taking a look at the existing reads for B against the new A assembly, and seeing whether the mutations are minor enough that I might be able to get away with preparing a pseudo-reference for B based on variant calling?

ADD REPLY • link 5.1 years ago by g • 0

0

Entering edit mode

I was a little off in my estimate for dual illumina and nanopore, but the service I was thinking of was https://microbesng.uk/our-services/

You might only need illumina if you don’t bother to close the genome. In any case, you should be able to do both genomes for well under a thousand.

ADD REPLY • link 5.1 years ago by Joe 21k

score 0 · Answer 1 · 2019-04-02

Indeed a nasty situation. I would personally go with the third option, I think it's somewhat of a compromise. Nevertheless, You would refrain from comparing A and B directly as any change in expression would be questionable, the sequence can change reagents affinity for instance in one strain leading to differential reads counts. The questions you can definitely answer here (using either mapping strategy) are interaction questions, e.g. A:time1 vs B:time1 which will take into account differences between the baselines of A and B, that might be biological or technical, and just say what's the difference between the strains in handling the different condition.