Question: Concatenating fastq.gz files across lanes
0
gravatar for volvicpellegrino
2.5 years ago by
volvicpellegrino0 wrote:

Hi,

I have spent several hours trying to figure out the best approach to do this. It would have been quicker to manually do it, but I will need to do this in future.

I have 40 paired-end RNAseq samples that were read across 5 lanes. I therefore have 400 fastq.gz files that I would like to process in Kallisto. The file name structure is as follows:

string_laneID_sampleID_pairID.fastq.gz

'string' is the same for every file

I want to concatenate the 5 lane files for each of the 40 samples, rather than running Kallisto for 200 paired end samples (is this the correct approach?).

Can someone please advise on the best way to concatenate these files? I have some knowledge of python and could do a bash script if someone could explain what each part means. Thank you

rna-seq • 9.6k views
ADD COMMENTlink modified 4 months ago by Satyajeet Khare1.6k • written 2.5 years ago by volvicpellegrino0
10
gravatar for GenoMax
2.5 years ago by
GenoMax92k
United States
GenoMax92k wrote:

cat files per sample across lanes for R1/R2 reads separately. I assume string has no implication.

cat string_L001_sampleID_R1.fastq.gz string_L002_sampleID_R1.fastq.gz string_L003_sampleID_R1.fastq.gz string_L004_sampleID_R1.fastq.gz string_L005_sampleID_R1.fastq.gz > string_sampleID_R1.fastq.gz

cat string_L001_sampleID_R2.fastq.gz string_L002_sampleID_R2.fastq.gz string_L003_sampleID_R2.fastq.gz string_L004_sampleID_R2.fastq.gz string_L005_sampleID_R2.fastq.gz > string_sampleID_R2.fastq.gz

Since you know bash scripting this should be simple to do for you by selecting the right files and doing the above.

In future ask the sequencing center to use --no-lane-splitting option for Illumina's bcl2fastq program to get a single file per sample automatically.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by GenoMax92k

Thanks for your help - the trouble is I have no idea how to specify the right files in a for loop. I understand the principle, its how you construct the code that is the issue. I guess I might have more luck with python. I suppose I could try an os.walk across the directory and a for loop check for sampleID and then somehow execute a shell script to concatenate all files with a given sampleID into a new file.

ADD REPLYlink written 2.5 years ago by volvicpellegrino0
2

Let us do a very simple two step approach (@Pierre has a fancier one liner).

Step 1: Grab the unique sample ID's in a file

ls -1 *R1*.gz | awk -F '_' '{print $3}' | sort | uniq > ID

ID file should have

sampID1
sampID2
sampID3
sampID4
sampID5

Step 2: Walk through the ID file one record at a time to create the command line you need for each cat command. This can be done in more complex ways but I am using a command line that should be easy to understand.

for i in `cat ./ID`; do echo cat string_L001_$i\_R1.fastq.gz string_L002_$i\_R1.fastq.gz string_L003_$i\_R1.fastq.gz string_L004_$i\_R1.fastq.gz \> $i\_R1.fastq.gz; done

should get you output below (remove the word echo when everything looks good to actually execute the commands, repeat for R2 files.).

cat string_L001_sampID1_R1.fastq.gz string_L002_sampID1_R1.fastq.gz string_L003_sampID1_R1.fastq.gz string_L004_sampID1_R1.fastq.gz > sampID1_R1.fastq.gz
cat string_L001_sampID2_R1.fastq.gz string_L002_sampID2_R1.fastq.gz string_L003_sampID2_R1.fastq.gz string_L004_sampID2_R1.fastq.gz > sampID2_R1.fastq.gz
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by GenoMax92k

Thank you so much to both of you! That makes total sense.

ADD REPLYlink written 2.5 years ago by volvicpellegrino0

been using this method for a while, always been dubious of it since cating multiple .gz like this seems sketchy but it always seems to work..

ADD REPLYlink written 2.5 years ago by steve2.6k
5
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:
find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | cut -d _ -f 2,3  ; done | sort | uniq | while read P; do find ./ -type f -name "string_${P}_*.fastq.gz"  -exec cat '{}' ';'  > ${P}.merged.fq.gz ; done
ADD COMMENTlink written 2.5 years ago by Pierre Lindenbaum131k

Thanks very much for this - could you explain how it works or point me in the right direction to understand?

ADD REPLYlink written 2.5 years ago by volvicpellegrino0

Thanks again, so breaking it down it looks like it lists all files in the directory, truncates the directory values, extracts the 2nd and 3rd fields delimited by '_', sorts them (alphanumerically?), filters matching occurrences, and then runs a concatenate command for every file matching each line of the text to output to a file containing the sampleID.merged... Is that right? What does the semicolon ; denote in the cat command?

ADD REPLYlink written 2.5 years ago by volvicpellegrino0
5
gravatar for Paul
2.5 years ago by
Paul1.4k
European Union
Paul1.4k wrote:

If your FASTQ files have always suffix *_L00*_R*_001.fastq.gz , wich is standard Illumina output. This script should works - it concatenate fastq files according all lanes separate by R1 + R2:

#!/bin/bash

for i in $(find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | rev | cut -c 22- | rev; done | sort | uniq)

    do echo "Merging R1"

cat "$i"_L00*_R1_001.fastq.gz > "$i"_ME_L001_R1_001.fastq.gz

       echo "Merging R2"

cat "$i"_L00*_R2_001.fastq.gz > "$i"_ME_L001_R2_001.fastq.gz

done;

Save script to text editor fastq_lane_merging.sh And make executable:

chmod +x fastq_lane_merging.sh

And run in folder with your FASTQ files like: ./fastq_lane_merging.sh

If you want understand what script do - I would recommend to try play with part of one-liner in your terminal and see what is happening.

Note: Best would be to add some check point to this script, if you have the same count of FASTQ files for each lane and both reads.

ADD COMMENTlink modified 24 months ago • written 2.5 years ago by Paul1.4k

Hi,

This has been a really useful script! Just wanted to know if there is a way to change it so that I can run when my fastq files are in different sub-directories? Currently it only works if they are in the same one.

Cheers!

ADD REPLYlink written 19 months ago by unawaz50

What about to create link of fastq files or use output of find command?

ADD REPLYlink written 9 months ago by Paul1.4k

Hi Paul, I am not good at programming. I at trying to use your code but I am not very successful. I have the following files but they end in _1 or _2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-407_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-407_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-405_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-405_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-406_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-406_2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-408_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-408_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-407_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-407_2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-405_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-405_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-408_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-408_2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-406_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-406_2.fq.gz

Could you help me with the code? I tried using the cat loop but when I used the concatenated for assembly with MaSurca, it didn't detect any sequence. I did fastqc them and they are fine... I just want to have them in one file for forward and one for reverse, _1 and _2 respectively. And this has been very painful...

Thanks;

ADD REPLYlink written 15 days ago by ja5691160
0
gravatar for swbarnes2
2.5 years ago by
swbarnes29.2k
United States
swbarnes29.2k wrote:

Depending on what you are using to align, you might be able to cat the samples on the fly, like

cat *.sample1*.fastq.gz | STAR ...

So if you already have a loop that is handling the variable for the sample name, just use that to build your command.

ADD COMMENTlink written 2.5 years ago by swbarnes29.2k
0
gravatar for Satyajeet Khare
4 months ago by
Satyajeet Khare1.6k
Pune, India
Satyajeet Khare1.6k wrote:
cat *S1*R1* >> S1_R1.fastq.gz

Where S1 is sample name in sample sheet. This can also be done

folder="/Your_directory"; for value in $(cat ${folder}/sample_sheet.csv | tail -n +24 | tr -s ' ' | cut -d, -f1); do   cat ${folder}/Output/*S${value}*R1* >> ${folder}/Output/S${value}_R1.fastq.gz ; cat ${folder}/Output/*S${value}*R2* >> ${folder}/Output/S${value}_R2.fastq.gz  ; done

In this command tail -n +24 will chop top rows of the sample sheet that contain peripheral information. Take a look at your sample sheet and decide the number.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Satyajeet Khare1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1794 users visited in the last hour