I've got an RNA-Seq project that got more complicated when it became apparent the reference genome was more different from my strains than anticipated. I'm trying to figure out the best (or least bad) sequence to map my reads against, and could use some help, since I am probably overthinking things at this point.
Things I have:
- Two bacterial strains (A and B). B was generated by random chemical mutagenesis of A (number & type of mutations are unknown.) They turn out to be different enough from the type strain C that they may constitute a distinct substrain.
- Timecourse RNA-Seq data for both A & B (single, not paired end). Multiple timepoints per biological sample, several biological replicates.
- Closed, annotated reference genome for the type strain C.
- Incomplete draft genomes for both A & B (Illumina sequencing). However, they're split across too many contigs. There are some low-complexity regions, near-identical gene copies, and other features that complicate assembly, along with some reads from a difficult-to-eradicate contaminant strain. Multiple assembly methods still leave them a bit of a mess.
- Simple variant calling for A & B against C using the reads obtained for the A & B draft genomes.
- It will be hard to convince other people on this project to spend the money to close or significantly improve the A & B drafts via PacBio or something unless it really is necessary. We didn't originally budget for genome sequencing.
Data I want to be able to assemble:
- Differential gene expression within one substrain (A-only or B-only) across timepoints or between A & B at one timepoint (nothing fancy - pairwise DESeq2 comparisons and I may also try some timecourse-focused tools).
- There are genes of particular interest that are directly affected by assembly issues in the current A & B draft genomes. I nevertheless want to make sure that my differential gene expression analyses for them are as decent as possible.
- I want to be reasonably sure about any differences in gene sequences for A & B vs. C, for interpreting paired proteomics data.
Concerns I have about about mapping my A/B RNA-Seq reads against specific sequences:
- If I align the A & B RNA-Seq reads against their respective draft genomes, it seems to me that there may be some gene-specific problems due to genome assembly issues. For example, I know at least one gene of interest is split across two contigs, and there are (known) duplicated genes that may be mis-counted because they show up as 1 tiny contig in the draft.
- If I align the A & B RNA-Seq reads against C, that avoids the draft genome issues seen for A/B, but obviously there will be more mismatches, which may affect analysis of differential expression in a non-uniform way, and which won't catch any areas where the differences may be too large to map reads to C at all.
- If I align the A & B RNA-Seq reads against pseudo-reference sequences, generated by using the A & B variant calling data to modify the C reference genome, I may also run into other issues related to variant calling artifacts.
What's the least-bad way to try to analyze this dataset?