Question

Merging compressed fastq files based on a conditions defined in a csv file

0

Entering edit mode

2.2 years ago

D ▴ 10

Hello everybody,

I have a question quite different about similar topic addressed on: Post not found

I tried Paul's bash script in the web indicated above (fastq_lane_merging.sh) adapting to my filename organization data being:

#!/bin/bash

for i in $(find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | cut -d "_" -f3 | ; done | sort | uniq)
do
    echo "Merging R1"
    cat *_*_"$i"_1.fastq.gz > "$i"_AL3936_R1.fastq.gz
    echo "Merging R2"
    cat *_*_"$i"_2.fastq.gz > "$i"_AL3936_R2.fastq.gz
done

but it does not work.

Let me expose briefly my problem:

I'm newbie in bioinformatics and sorry for this basic question. From a shotgun analysis each sample has been analysed by paired-end strategy in different number of lanes.The sequencing center has not used the option --no-lane-splitting for Illumina's bcl2fastq program to get a single file per sample automatically. As consequence, as raw data, I have the following filenames in a folder called raw data:

Raw data folder contains the following files:

H3CH3DRXX_1_109UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_1_109UDI-idt-UMI_2.fastq.gz
H3CH3DRXX_1_97UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_1_97UDI-idt-UMI_2.fastq.gz
H3CH3DRXX_2_109UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_2_109UDI-idt-UMI_2.fastq.gz
H3CH3DRXX_2_97UDI-idt-UMI_1.fastq.gz
H3CH3DRXX_2_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_1_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_1_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_1_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_1_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_2_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_2_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_2_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_2_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_3_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_3_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_3_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_3_97UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_4_109UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_4_109UDI-idt-UMI_2.fastq.gz
HGTVGDSXX_4_97UDI-idt-UMI_1.fastq.gz
HGTVGDSXX_4_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_1_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_1_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_2_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_2_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_3_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_3_97UDI-idt-UMI_2.fastq.gz
HGV35DSXX_4_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_4_97UDI-idt-UMI_2.fastq.gz
HGV52DSXX_1_97UDI-idt-UMI_1.fastq.gz
HGV52DSXX_1_97UDI-idt-UMI_2.fastq.gz

And through code in bash I would like to using the next file attached below (match_ids.csv) that links sample name code (starting with "AL) towards to different flowcellmultiplexcode (code starting with letter "H") that appears above in fastq.gz filename as you can see:

match_ids.csv file:

left column (sample id), right column: string from the original filename that belongs from the sample indicated in left column.

AL3936  H3CH3DRXX_1_97UDI-idt-UMI
AL3936  H3CH3DRXX_2_97UDI-idt-UMI
AL3936  HGTVGDSXX_1_97UDI-idt-UMI
AL3936  HGTVGDSXX_2_97UDI-idt-UMI
AL3936  HGTVGDSXX_3_97UDI-idt-UMI
AL3936  HGTVGDSXX_4_97UDI-idt-UMI
AL3936  HGV35DSXX_1_97UDI-idt-UMI
AL3936  HGV35DSXX_2_97UDI-idt-UMI
AL3936  HGV35DSXX_3_97UDI-idt-UMI
AL3936  HGV35DSXX_4_97UDI-idt-UMI
AL3936  HGV52DSXX_1_97UDI-idt-UMI
AL3936  HGV52DSXX_2_97UDI-idt-UMI
AL3936  HGV52DSXX_3_97UDI-idt-UMI
AL3936  HGV52DSXX_4_97UDI-idt-UMI
AL3937  H3CH3DRXX_1_109UDI-idt-UMI
AL3937  H3CH3DRXX_2_109UDI-idt-UMI
AL3937  HGTVGDSXX_1_109UDI-idt-UMI
AL3937  HGTVGDSXX_2_109UDI-idt-UMI
AL3937  HGTVGDSXX_3_109UDI-idt-UMI
AL3937  HGTVGDSXX_4_109UDI-idt-UMI

Therefore, I would like to create a bash script to merge all forward (and after all reverse) fastq.gz files belonging to the same sample name (as defined in csv file) and the merged fastqfile containing the sample name code (starting of "AL").

To illustrate the problem, let me show you an example of purpose of my question:

From the next forward fastq files:

      H3CH3DRXX_1_97UDI-idt-UMI_1.fastq.gz
    H3CH3DRXX_2_97UDI-idt-UMI_1.fastq.gz
    HGTVGDSXX_1_97UDI-idt-UMI_1.fastq.gz
       HGTVGDSXX_2_97UDI-idt-UMI_1.fastq.gz
     HGTVGDSXX_3_97UDI-idt-UMI_1.fastq.gz
      HGV35DSXX_1_97UDI-idt-UMI_1.fastq.gz
    HGV35DSXX_2_97UDI-idt-UMI_1.fastq.gz
HGV35DSXX_3_97UDI-idt-UMI_1.fastq.gz
     HGV35DSXX_4_97UDI-idt-UMI_1.fastq.gz
HGV52DSXX_1_97UDI-idt-UMI_1.fastq.gz

Output desired:

Only one fastq.gz file containing the above 10 fastq.gz files merged and with the next filename:

AL3936_1.fastq.gz

Thanks on advance for your help and hints,

bash linux • 837 views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 2.2 years ago by D ▴ 10

0

Entering edit mode

files, HGV35DSXX_2_97UDI-idt-UMI_1.fastq.gz,HGV35DSXX_3_97UDI-idt-UMI_1.fastq.gz and HGV52DSXX_1_97UDI-idt-UMI_1.fastq.gz belong to sample AL3936. Why aren't they included in the list of files to be merged, to form AL3936_1.fastq.gz?

I assume that all files for a sample must be merged together for _1.fastq.gz and _2.fastq.gz. I suggest three step process:

1) parallel --plus --col-sep="\t" rename 's/{2}/{1}_{2}/' *.gz :::: match_ids.csv

This would append sample names to each file as mentioned in match_ids.csv (tab separated). After this step, create a directory by name "combined" in the same directory

2) cut -f1 match_ids.csv | sort | uniq | while read line; do echo cat "$line"*_1.fastq.gz ">" combined/"$line"_1.fastq.gz;done

3) cut -f1 match_ids.csv | sort | uniq | while read line; do echo cat "$line"*_2.fastq.gz ">" combined/"$line"_2.fastq.gz;done

This would store the combined files with a directory "combined".

strongly advice to take a backup of files before running the scripts.

ADD REPLY • link 2.2 years ago by cpad0112 21k

0

Entering edit mode

Thanks cpad0112 for your help.

I've included the three fastq.gz files indicated by you that I've forgotten to attach. Thank you.

I installed parallel and rename in my linux machine.

However the 1) command does not change the match_ids.csv and the 2) does not create any output.

I think because with cut -f1 you pick up the code started by AL but this code is not appear in the original filename.

Thanks for your hints. I will continue to think.

ADD REPLY • link 2.2 years ago by D ▴ 10

0

Entering edit mode

1) command does not change the match_ids.csv

First command doesn't change the match_ids.csv. First command looks for files listed in second column of match_ids.csv and appends appropriate sample ID from first column. For this you need to keep both match_ids.csv and all fastqs in the same folder

2) does not create any output.

Since 1 didn't work, 2nd function will not create output as names do not match.

Run a tree command on current directory where both csv and fastq files are kept. In addition, is the field separator between first column and second column tab, comma or space?

ADD REPLY • link 2.2 years ago by cpad0112 21k