Hi all -- I'm new to RNA-seq and have had some issues assembling the reads. I'm looking for any advice or input on what might be the best way to handle my data.
My work is done in oocytes of a non-model organism without a reference genome.
I have performed an RNA-IP against two different proteins and processed the IPs for paired end RNA seq, my goal is to identify the transcripts associated with both of these proteins (overlap). In addition, I have also processed whole oocytes for RNA-seq. Everything was done with three biological replicates.
Since my work is done in a non-model organism I have been using Trinity to assemble my paired-end RNA-seq data. There are a few ways, I think, this can be done however and I could use any input on what the best method might be. I've dabbled with one and had some errors, which is why I'm confused and wondering if a different approach is better.
- OPTION 1:
Assemble the whole oocyte transcriptome using Trinity and use this as a reference genome.
- After assembly, I used trinotate to cross reference my assembly to a recently-released protein database for my organism. I believe this assigned contigs a protein annotation.
- I then used the built-in Trinity plugins to align and estimate transcript abundance using RSEM for each IP sample separately
- I simply used the raw fastq files (left and right) for each IP (did they need to be assembled here??).
- Looking at the RSEM.isoforms.results output, I saw in every IP that a control transcript had a 0 FPKM, and I'm assuming is not expressed. This is obviously concerning...especially since using the sample sample I could identify my control transcript by qPCR.
2. OPTION 2:
Assemble all IP reads together. In this case I would then map each IP's raw fastq file back to this "IP-transcriptome" to try and estimate transcript abundance using RSEM. (I'd ignore the whole oocyte data in this scenario)
3. OPTION 3:
Individually and separately assembly each IP. I would then use transdecoder, trinotate, and blast to try and map these reads to the recently-released protein database. I would use the protein database.fasta file as the reference in this case.
Which option seems best? Any idea why my first approach failed to show the control transcript?
This is my first time doing RNA-seq so I apologize if these questions are very naive! All advice is greatly appreciated. Thank you!