Question: Merge fastq files
0
gravatar for BM
2 days ago by
BM60
United Kingdom
BM60 wrote:

I have RNA-seq fastq files, each sample has multiple files from different lanes: A10_S4_R1.fastq.gz A10_S8_R1.fastq.gz A10_S40_R1.fastq.gz A10_S4_R2.fastq.gz A10_S8_R2.fastq.gz A10_S40_R2.fastq.gz

then A11 up A40

I am trying to merge using: cat A11*_R1.fastq.gz > A11_R1.fastq.gz This is fine, but I need a command to loop through folder and merge all R1 files for one sample then next sample, as well as R2 files

I have used answers from previous post, but none of them work

printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done

for name in *.fastq.gz; do
printf '%s\n' "${name%_*_*_R[12]*}"
done | uniq |

for f in *.fastq.gz; do 
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+\.fastq\.gz$ ]]
cat "$f" >> "${BASH_REMATCH[1]}${BASH_REMATCH[2]}.fastq.gz"
done

Can anyone advise, Thanks in adance

ADD COMMENTlink modified 2 days ago by Asaf8.0k • written 2 days ago by BM60

Each sample has multiple files from different lanes

Examples you have posted don't seem to indicate so. Files for samples running in different lanes will have a L00* inclusion in the file name. There can at most be 8 lanes on Illumina FC so there is no chance of having 40 lanes (unless the sample ran across multiple FC but even then L00* number would be repeated across FC).

The S* numbers you have are just row number for that particular sample in the samplesheet used for demultiplexing. They don't have any useful meaning.

Disclaimer: Unless your sequencing facility is doing something non-standard.

ADD REPLYlink modified 2 days ago • written 2 days ago by genomax85k

Sorry for the confusion, the original files were as you said from different lanes, e.g. A14_S90_L008_R1_001.fastq.gz Files were merged using

    printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done

Which resulted in the files A10_S4_R1.fastq.gz A10_S8_R1.fastq.gz A10_S40_R1.fastq.gz A10_S4_R2.fastq.gz A10_S8_R2.fastq.gz A10_S40_R2.fastq.gz etc

Thats were I get stuck, I cant merge these files

ADD REPLYlink written 2 days ago by BM60

I see. So at this point you just need to focus on A* since those S* are not useful.

ADD REPLYlink modified 2 days ago • written 2 days ago by genomax85k

So would it be

    printf '%s\n' *.fastq.gz | sed 's/^\([^_]*_*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq.gz >"${prefix}_R1.fastq.gz"
    cat "$prefix"*R2*.fastq.gz >"${prefix}_R2.fastq.gz"
done
ADD REPLYlink written 2 days ago by BM60

Any help in how I would do this please?

ADD REPLYlink written 2 days ago by BM60

If you can, ask the people making the fastqs to use the --no-lane-splitting option when making the fastqs.

ADD REPLYlink written 2 days ago by swbarnes27.8k
0
gravatar for Asaf
2 days ago by
Asaf8.0k
Israel
Asaf8.0k wrote:

In nextflow you can do:

def getLibraryId( prefix ){
  // Return the ID number, you can change for other file formats, here it just takes the first part before "_"
  prefix.split("_")[0]
}

// Gather the pairs of R1/R2 according to sample ID
Channel
     .fromFilePairs(params.datadir + '/**R{1,2}*.fastq.gz', flat: true)
     .map { prefix, file1, file2 -> tuple(getLibraryId(prefix), file1, file2) }
     .groupTuple().set{ files_channel }

Now files_channel will contain tuples of the prefix and two lists of all the R1 files and all the files in R2. See more here: https://nextflow-io.github.io/patterns/index.html#_process_outputs_into_groups (only single-ended though)

ADD COMMENTlink written 2 days ago by Asaf8.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 897 users visited in the last hour