Hi,
I have been given a bunch of files from an RNAseq output which looks like this
3062_GTGGCC_L003_R2_008.fastq.gz 3062_GTGGCC_L003_R2_007.fastq.gz 3062_GTGGCC_L003_R2_006.fastq.gz .... 3062_GTGGCC_L003_R2_001.fastq.gz
3062_GTGGCC_L003_R1_008.fastq.gz 3062_GTGGCC_L003_R1_007.fastq.gz
I haven't been given much info about them other than they are all from the same sample - I presume the R1 is forward and R2 is reverse pairs and that the total forward and reverse had to be split into 8 each because of file size issues or something.
I know there has been threads before about merging fastq files just using a simple shell script - is it simply as easy as concatenating them? I am a bit suspicious it looks to simple and I am wary of introducing errors with paired reads further down the line. Can anyone give me a bit of advice as to the best way about merging them into a single forward and reverse file for downstream analysis.
Thanks.
They are split because there is a parameter in bcl2fastq (Illumina's demultiplexing tool). It defaults to 4,000,000 reads per file. I think that's for historical reasons. I never heard of a problem with FASTQs being too big, but people complain about too many FASTQs.
And yes, as Pierre already said, you can simply
cat
gzipped files. I was surprised to hear about this too. Just make sure you keep R1 and R2 separate.it's better to have multiple fastq per sample: for WGS/WES you can the map them in parallel and then merge the results at the end.
Sure. But there is nothing magical about 4,000,000 as far as I know.
Also, if you want to split your mapping, you can always split the full FASTQs.