Question

When to merge RNA-seq files with multiple sequencing lanes

0

Entering edit mode

9 weeks ago

Olivia • 0

Hi,

I am currently working through an RNA-seq pipeline, where I have multiple files per sample, since they were sequenced in multiple lanes (i.e., L007 and L006) to reach a depth of 50 million reads.

I have gone through most of the trimming/assembly (Trinity), clustering (CD-HIT), and annotation steps (Trinotate), and now I will be using GOseq for quantification steps. Unfortunately, I did not merge the multiple files/reads for different lanes per sample together, but I have read from a few sources that doing this after alignment steps can be beneficial (especially for batch effects). But, I am confused as to how I should combine the files together and at what point? I have seen some examples for merging via samtools for bam files, but I am not sure where I can get a bam file from? Pretty soon, I will be reusing P.trim.gz files from the Trinity output again for estimating transcript abundance, so maybe there is a way I could these somehow? Or just wait until the later stages once everything is already quantified and will need to analyze?

I am very new to RNA-seq, so any help would be greatly appreciated!

Thanks, Olivia

RNA-seq DEseq2 Trinity CD-HIT GOseq • 4.1k views

ADD COMMENT • link updated 9 weeks ago by GenoMax 154k • written 9 weeks ago by Olivia • 0

2

Entering edit mode

how I should combine the files together and at what point?

Samples running in multiple lanes represent technical sequencing replicates and can be merged either before the alignments (as sequence) or after the alignments (which is where you get the BAM alignment files) on per sample basis.

Some people prefer to process individual lane specific files separately to speed the alignment process up (smaller files aligned in parallel than one large file per sample). With either of the option, you will make sure that you finally end up with one aligned data file per sample. Do not merge biological replicates, since that information is needed for expression analysis.

I will be reusing P.trim.gz files from the Trinity output again for estimating transcript abundance, so maybe there is a way I could these somehow?

What does this mean? Is there no reference genome/transcriptome available for organism you are working with. There is generally no need to do trinity transcriptome assemblies, if former are available.

ADD REPLY • link 9 weeks ago by GenoMax 154k

score 2 · Answer 1 · 2025-09-03

2

Entering edit mode

9 weeks ago

barslmn ★ 2.5k

If they're from the same sample and the same sequencing run you should merge the fastq files before anything else.

ADD COMMENT • link 9 weeks ago by barslmn ★ 2.5k