Hi,
I am currently working through an RNA-seq pipeline, where I have multiple files per sample, since they were sequenced in multiple lanes (i.e., L007 and L006) to reach a depth of 50 million reads.
I have gone through most of the trimming/assembly (Trinity), clustering (CD-HIT), and annotation steps (Trinotate), and now I will be using GOseq for quantification steps. Unfortunately, I did not merge the multiple files/reads for different lanes per sample together, but I have read from a few sources that doing this after alignment steps can be beneficial (especially for batch effects). But, I am confused as to how I should combine the files together and at what point? I have seen some examples for merging via samtools for bam files, but I am not sure where I can get a bam file from? Pretty soon, I will be reusing P.trim.gz files from the Trinity output again for estimating transcript abundance, so maybe there is a way I could these somehow? Or just wait until the later stages once everything is already quantified and will need to analyze?
I am very new to RNA-seq, so any help would be greatly appreciated!
Thanks, Olivia
Samples running in multiple lanes represent technical sequencing replicates and can be merged either before the alignments (as sequence) or after the alignments (which is where you get the BAM alignment files) on per sample basis.
Some people prefer to process individual lane specific files separately to speed the alignment process up (smaller files aligned in parallel than one large file per sample). With either of the option, you will make sure that you finally end up with one aligned data file per sample. Do not merge biological replicates, since that information is needed for expression analysis.
What does this mean? Is there no reference genome/transcriptome available for organism you are working with. There is generally no need to do
trinity
transcriptome assemblies, if former are available.