Question

How to concatenate multiple fastq files (located in different directories) for each sample

0

Entering edit mode

3.1 years ago

salehm ▴ 10

Hi,

I received RNA seq data for 55 samples run by illumina sequencer Nextseq500. Each sample has 4 fastq files and each file is in a separate directory. So I have a total of 220 directories, each directory has only one fastq file. Now I need to concatenate each 4 files (belong to their respective sample) in a single fastq file. I used to use this command:

However, it needs that all files to be in one directory. My files are now in 220 directories. So I am wondering if there is a way to modify this command to look for files in different directories. Or if there is a command, I could use to move each file in the individual directories to a single directory.

Thank you for your help.

RNA-Seq • 1.3k views

ADD COMMENT • link updated 3.1 years ago by rpolicastro 13k • written 3.1 years ago by salehm ▴ 10

score 2 · Answer 1 · 2021-03-16

2

Entering edit mode

3.1 years ago

GenoMax 141k

See this answer for inspiration: C: Concatenating fastq.gz files across lanes

ADD COMMENT • link 3.1 years ago by GenoMax 141k

score 1 · Answer 2 · 2021-03-16

1

Entering edit mode

3.1 years ago

rpolicastro 13k

GNU parallel solution that is untested but should work (which will probably summon Ole Tange to provide a better version). This will search recursively through directories from the parent directory, which is the only requirement.

parallel --dry-run -j1 cat {} '>>' '$(basename {} | rev | cut -c 22- | rev)'_ME_L001_R1_001.fastq.gz ::: $(find . -type f -name ".fastq.gz")

Remove --dry-run if it looks good. -j1 means to run the command for one file at a time, but you can increase that for parallelization (or remove it to use all available cores).