Question

Merge fastq files

0

Entering edit mode

5.0 years ago

BM ▴ 70

I have RNA-seq fastq files, each sample has multiple files from different lanes: A10_S4_R1.fastq.gz A10_S8_R1.fastq.gz A10_S40_R1.fastq.gz A10_S4_R2.fastq.gz A10_S8_R2.fastq.gz A10_S40_R2.fastq.gz

then A11 up A40

I am trying to merge using: cat A11*_R1.fastq.gz > A11_R1.fastq.gz This is fine, but I need a command to loop through folder and merge all R1 files for one sample then next sample, as well as R2 files

I have used answers from previous post, but none of them work

printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done

for name in *.fastq.gz; do
printf '%s\n' "${name%_*_*_R[12]*}"
done | uniq |

for f in *.fastq.gz; do 
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+\.fastq\.gz$ ]]
cat "$f" >> "${BASH_REMATCH[1]}${BASH_REMATCH[2]}.fastq.gz"
done

Can anyone advise, Thanks in adance

fastq bash Unix concatenate merge • 3.7k views

ADD COMMENT • link updated 5.0 years ago by Asaf 10k • written 5.0 years ago by BM ▴ 70

0

Entering edit mode

Each sample has multiple files from different lanes

Examples you have posted don't seem to indicate so. Files for samples running in different lanes will have a L00* inclusion in the file name. There can at most be 8 lanes on Illumina FC so there is no chance of having 40 lanes (unless the sample ran across multiple FC but even then L00* number would be repeated across FC).

The S* numbers you have are just row number for that particular sample in the samplesheet used for demultiplexing. They don't have any useful meaning.

Disclaimer: Unless your sequencing facility is doing something non-standard.

ADD REPLY • link 5.0 years ago by GenoMax 152k

0

Entering edit mode

Sorry for the confusion, the original files were as you said from different lanes, e.g. A14_S90_L008_R1_001.fastq.gz Files were merged using

    printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done

Which resulted in the files A10_S4_R1.fastq.gz A10_S8_R1.fastq.gz A10_S40_R1.fastq.gz A10_S4_R2.fastq.gz A10_S8_R2.fastq.gz A10_S40_R2.fastq.gz etc

Thats were I get stuck, I cant merge these files

ADD REPLY • link 5.0 years ago by BM ▴ 70

0

Entering edit mode

I see. So at this point you just need to focus on A* since those S* are not useful.

ADD REPLY • link 5.0 years ago by GenoMax 152k

0

Entering edit mode

So would it be

    printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done

ADD REPLY • link 5.0 years ago by BM ▴ 70

0

Entering edit mode

Any help in how I would do this please?

ADD REPLY • link 5.0 years ago by BM ▴ 70

0

Entering edit mode

If you can, ask the people making the fastqs to use the --no-lane-splitting option when making the fastqs.

ADD REPLY • link 5.0 years ago by swbarnes2 15k

score 1 · Answer 1 · 2020-06-29

In nextflow you can do:

def getLibraryId( prefix ){
  // Return the ID number, you can change for other file formats, here it just takes the first part before "_"
  prefix.split("_")[0]
}

// Gather the pairs of R1/R2 according to sample ID
Channel
     .fromFilePairs(params.datadir + '/**R{1,2}*.fastq.gz', flat: true)
     .map { prefix, file1, file2 -> tuple(getLibraryId(prefix), file1, file2) }
     .groupTuple().set{ files_channel }

Now files_channel will contain tuples of the prefix and two lists of all the R1 files and all the files in R2. See more here: https://nextflow-io.github.io/patterns/index.html#_process_outputs_into_groups (only single-ended though)