I am working on RNAseq data from E. coli k12 substr. MG1655, which has an added expression vector. I am interested in the expression of the genes within this vector. But of course, when I align the transcripts to the reference genome (which does not have the vector), the expression vector genes I am interested in are unmapped.
So to get a better reference, I de novo assembled the transcriptome of my control sample (with rnaSPAdes) so that the vector would be included in the assembly. The assembly stats were quite bad, N50 = 2315 bp and the de novo transcriptome was ~500,000 bp larger than the original reference Ecoli MG1655. I then used Scaffold_Builder to try to improve the transcriptome with the MG1655 reference genome, which resulted in a better N50 of 780113 bp but now 1.4 mill bp larger than the reference.
I considered trying Trinity's genome-guided option, but unmapped reads do not get included, defeating the purpose of what I am trying to do.
Can someone please provide some suggestions on how to further improve and refine my new reference transcriptome? I want to be sure that the reference is of good enough quality for my downstream expression analysis. How can I be sure that it is? Of course, I am hoping to do this without further sequencing if possible.
Thanks in advance!
Nevermind, I found this post also mentioning what you suggested. A: Quantification of a gene that is not in the reference genome
Wow, I was making this so much harder than need be. So if the vector contains three genes I am interested in, I could
catonly those three into the reference fasta?