Merge fastq files
1
0
Entering edit mode
17 months ago
BM ▴ 70

I have RNA-seq fastq files, each sample has multiple files from different lanes: A10_S4_R1.fastq.gz A10_S8_R1.fastq.gz A10_S40_R1.fastq.gz A10_S4_R2.fastq.gz A10_S8_R2.fastq.gz A10_S40_R2.fastq.gz

then A11 up A40

I am trying to merge using: cat A11*_R1.fastq.gz > A11_R1.fastq.gz This is fine, but I need a command to loop through folder and merge all R1 files for one sample then next sample, as well as R2 files

I have used answers from previous post, but none of them work

printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done

for name in *.fastq.gz; do
printf '%s\n' "${name%_*_*_R[12]*}"
done | uniq |

for f in *.fastq.gz; do 
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+\.fastq\.gz$ ]]
cat "$f" >> "${BASH_REMATCH[1]}${BASH_REMATCH[2]}.fastq.gz"
done

Can anyone advise, Thanks in adance

fastq bash Unix concatenate merge • 1.2k views
ADD COMMENT
0
Entering edit mode

Each sample has multiple files from different lanes

Examples you have posted don't seem to indicate so. Files for samples running in different lanes will have a L00* inclusion in the file name. There can at most be 8 lanes on Illumina FC so there is no chance of having 40 lanes (unless the sample ran across multiple FC but even then L00* number would be repeated across FC).

The S* numbers you have are just row number for that particular sample in the samplesheet used for demultiplexing. They don't have any useful meaning.

Disclaimer: Unless your sequencing facility is doing something non-standard.

ADD REPLY
0
Entering edit mode

Sorry for the confusion, the original files were as you said from different lanes, e.g. A14_S90_L008_R1_001.fastq.gz Files were merged using

    printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done

Which resulted in the files A10_S4_R1.fastq.gz A10_S8_R1.fastq.gz A10_S40_R1.fastq.gz A10_S4_R2.fastq.gz A10_S8_R2.fastq.gz A10_S40_R2.fastq.gz etc

Thats were I get stuck, I cant merge these files

ADD REPLY
0
Entering edit mode

I see. So at this point you just need to focus on A* since those S* are not useful.

ADD REPLY
0
Entering edit mode

So would it be

    printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done
ADD REPLY
0
Entering edit mode

Any help in how I would do this please?

ADD REPLY
0
Entering edit mode

If you can, ask the people making the fastqs to use the --no-lane-splitting option when making the fastqs.

ADD REPLY
0
Entering edit mode
17 months ago
Asaf 8.6k

In nextflow you can do:

def getLibraryId( prefix ){
  // Return the ID number, you can change for other file formats, here it just takes the first part before "_"
  prefix.split("_")[0]
}

// Gather the pairs of R1/R2 according to sample ID
Channel
     .fromFilePairs(params.datadir + '/**R{1,2}*.fastq.gz', flat: true)
     .map { prefix, file1, file2 -> tuple(getLibraryId(prefix), file1, file2) }
     .groupTuple().set{ files_channel }

Now files_channel will contain tuples of the prefix and two lists of all the R1 files and all the files in R2. See more here: https://nextflow-io.github.io/patterns/index.html#_process_outputs_into_groups (only single-ended though)

ADD COMMENT

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6