Hi all! I have some RNA-seq (single-read) datasets divided in two different SRA, one with ~30 million reads and the other with ~15 million reads. I have been reading that I could merge the fastq files, sam or bam files and I would like to know if there is any differences regarding the quality of the final dataset. Thanks!!
I recommend to quality-trim & align them independently, with the aligner directly piped into SAMtools sort (that avoids the unnecessary SAM files). Then check the alignment rate for every file and keep only those that you feel comfortable with. I had it before that technical replicates (same library over multiple lanes over several years as part of a large published study) had strikingly different quality, with the first replicate showing like 95% alignment rate, and the last one like 40% with a lot of trash reads (maybe sample got degraded over time in the freezer, I don't know). In any case, do not merge too early as you may lose the ability to discard bad samples if necessary. Do not trust that published data are always good quality, there are a lot of junk datasets out there in the SRA.