Question

Merge replicate fastq files of the same sample together?

0

Entering edit mode

2.9 years ago

foxiw ▴ 10

I have 13 samples which were initially sequenced and expected to yield around 30 million reads per sample, however we only managed to get half of that. Therefore we re-pooled the samples and ran them again, getting more reads this time. Therefore I have 2 files per sample, run 1 (with low reads) and run 2 (with high reads). I tried merging these files together (using cat function) to get even greater depth but my QC analysis shows a lot of duplications. Is there a better way of merging these files whilst avoiding high levels of duplications? Thanks!

fastq RNA-seq • 1.3k views

ADD COMMENT • link 2.9 years ago by foxiw ▴ 10

score 2 · Accepted Answer · 2021-06-08

2

Entering edit mode

2.9 years ago

GenoMax 141k

RNAseq data is expected to have some duplication. So if you are going based on FastQC result of something showing up as "overrepresented" then that is likely not a big cause of worry. You should go ahead and analyze the data.

Is there a better way of merging these files whilst avoiding high levels of duplications?

If you do have a problem with duplicates (coming from one too many PCR cycles) then nothing is going to fix that at this stage.

ADD COMMENT • link 2.9 years ago by GenoMax 141k

0

Entering edit mode

Thanks.

After using multiqc on my html files, my Sequence Counts shows me that there is a lot of duplications per sample. However, it is making me think that multiqc is considering the sequences from run 2 as 'duplicates' of run 1, which means I haven't merged the files properly. Is is normal then?

ADD REPLY • link 2.9 years ago by foxiw ▴ 10

0

Entering edit mode

cat'ing the files simply tacks contents of second file at end of first. So as far as FastQC/MultiQC is concerned that is just one set of data. That in itself is going to have no effect on duplicates per se. Your sample is going to contain that duplication (if present). At this point you can't fix that part (if sample was overamplified for example). Hopefully all samples in this dataset underwent an identical treatment so there will be no experimental bias.