Question: Best way to merge RNA-seq data from two sequencing runs of the same samples
gravatar for unsupervised_learner
3.4 years ago by
United States
unsupervised_learner10 wrote:


I have paired-end RNA-seq reads from a drug-treatment experiment, with < 15 million mapped reads in many samples (too few reads) and large variability in mapped reads across biological replicates. Differential expression and splicing analysis on these samples indicate that statistical power in my tests could be improved if I had better sequencing depth, and I have remaining RNA from these samples available to re-sequence.

The questions

Is it analytically and statistically tractable to re-sequence the same samples and control for potential artifacts in the combined data?

What would be the best workflow for merging data from these two RNA-seq runs? I would guess that it's best to keep the runs separate until the counts have been summarized. Then I can carry out PCA to visually inspect the gross extent of artifact in the different runs before merging the counts for statistical analyses.

Beyond gross visual inspection of PC's, what sorts of quality control steps could I take if I identify a strong batch effect between the different sequencing runs? Would software like svaseq or combat be appropriate here if I do identify a batch effect? If so, would it be best to remove the batch effect in the samples before combining the count data?

rna-seq • 3.2k views
ADD COMMENTlink modified 3.4 years ago by WouterDeCoster44k • written 3.4 years ago by unsupervised_learner10

Technical replication of sequencing is excellent. As long as you stick to the same platform/read lengths it should be fine to run the libraries again. You could check the results with PCA before proceeding with rest of analysis.

ADD REPLYlink written 3.4 years ago by genomax89k
gravatar for WouterDeCoster
3.4 years ago by
WouterDeCoster44k wrote:

As you suggested (and confirmed by genomax2) it's probably the best to check using PCA if your two runs result in approximately the same result.

But as soon as you have determined it's okay I would suggest to merge your bam files, and repeat the counting before you do your final analysis. That would minimize your chance of errors.

Furthermore, in case you are using a two-step alignment (e.g. using STAR) it might be advantageous to merge the fastq files across runs and repeat the alignment.

ADD COMMENTlink written 3.4 years ago by WouterDeCoster44k


I have faced the same question

I have 4 lanes for each samples (paired end) in 2 experimental runs; For concatenating fastq files can I do like this ?

  cat fastq1_lane1_batch1 fastq1_lane1_batch2 fastq1_lane2_batch1   fastq1_lane2_batch2  fastq1_lane3_batch1 fastq1_lane3_batch2 fastq1_lane4_batch1 fastq1_lane4_batch2  > fastq1  

    cat fastq2_lane1_batch1 fastq2_lane1_batch2 fastq2_lane2_batch1   fastq2_lane2_batch2  fastq2_lane3_batch1 fastq2_lane3_batch2 fastq2_lane4_batch1 fastq2_lane4_batch2 > fastq2

Because PCA says no difference between runs

enter image description here

ADD REPLYlink modified 12 months ago • written 12 months ago by A3.9k

Yes. Note that you can also cat .fastq.gz files together without having to decompress.

See also How to add images to a Biostars post

ADD REPLYlink written 12 months ago by WouterDeCoster44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 875 users visited in the last hour