Concatenating fastq.gz files across lanes
5
4
Entering edit mode
5.9 years ago

Hi,

I have spent several hours trying to figure out the best approach to do this. It would have been quicker to manually do it, but I will need to do this in future.

I have 40 paired-end RNAseq samples that were read across 5 lanes. I therefore have 400 fastq.gz files that I would like to process in Kallisto. The file name structure is as follows:

string_laneID_sampleID_pairID.fastq.gz

'string' is the same for every file

I want to concatenate the 5 lane files for each of the 40 samples, rather than running Kallisto for 200 paired end samples (is this the correct approach?).

Can someone please advise on the best way to concatenate these files? I have some knowledge of python and could do a bash script if someone could explain what each part means. Thank you

RNA-Seq • 40k views
ADD COMMENT
24
Entering edit mode
5.9 years ago
GenoMax 141k

cat files per sample across lanes for R1/R2 reads separately. I assume string has no implication.

cat string_L001_sampleID_R1.fastq.gz string_L002_sampleID_R1.fastq.gz string_L003_sampleID_R1.fastq.gz string_L004_sampleID_R1.fastq.gz string_L005_sampleID_R1.fastq.gz > string_sampleID_R1.fastq.gz

cat string_L001_sampleID_R2.fastq.gz string_L002_sampleID_R2.fastq.gz string_L003_sampleID_R2.fastq.gz string_L004_sampleID_R2.fastq.gz string_L005_sampleID_R2.fastq.gz > string_sampleID_R2.fastq.gz

Since you know bash scripting this should be simple to do for you by selecting the right files and doing the above.

In future ask the sequencing center to use --no-lane-splitting option for Illumina's bcl2fastq program to get a single file per sample automatically.

ADD COMMENT
0
Entering edit mode

Thanks for your help - the trouble is I have no idea how to specify the right files in a for loop. I understand the principle, its how you construct the code that is the issue. I guess I might have more luck with python. I suppose I could try an os.walk across the directory and a for loop check for sampleID and then somehow execute a shell script to concatenate all files with a given sampleID into a new file.

ADD REPLY
3
Entering edit mode

Let us do a very simple two step approach (@Pierre has a fancier one liner).

Step 1: Grab the unique sample ID's in a file

ls -1 *R1*.gz | awk -F '_' '{print $3}' | sort | uniq > ID

ID file should have

sampID1
sampID2
sampID3
sampID4
sampID5

Step 2: Walk through the ID file one record at a time to create the command line you need for each cat command. This can be done in more complex ways but I am using a command line that should be easy to understand.

for i in `cat ./ID`; do echo cat string_L001_$i\_R1.fastq.gz string_L002_$i\_R1.fastq.gz string_L003_$i\_R1.fastq.gz string_L004_$i\_R1.fastq.gz \> $i\_R1.fastq.gz; done

should get you output below (remove the word echo when everything looks good to actually execute the commands, repeat for R2 files.).

cat string_L001_sampID1_R1.fastq.gz string_L002_sampID1_R1.fastq.gz string_L003_sampID1_R1.fastq.gz string_L004_sampID1_R1.fastq.gz > sampID1_R1.fastq.gz
cat string_L001_sampID2_R1.fastq.gz string_L002_sampID2_R1.fastq.gz string_L003_sampID2_R1.fastq.gz string_L004_sampID2_R1.fastq.gz > sampID2_R1.fastq.gz
ADD REPLY
0
Entering edit mode

Thank you so much to both of you! That makes total sense.

ADD REPLY
0
Entering edit mode

Thank you so much for this explanation and scripts. used it and worked perfectly fine for me! Special shoutout for giving a heads up as "remove the word echo when everything looks good to actually execute the commands"

ADD REPLY
0
Entering edit mode

been using this method for a while, always been dubious of it since cating multiple .gz like this seems sketchy but it always seems to work..

ADD REPLY
11
Entering edit mode
5.9 years ago
Paul ★ 1.5k

If your FASTQ files have always suffix *_L00*_R*_001.fastq.gz , wich is standard Illumina output. This script should works - it concatenate fastq files according all lanes separate by R1 + R2:

#!/bin/bash

for i in $(find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | rev | cut -c 22- | rev; done | sort | uniq)

    do echo "Merging R1"

cat "$i"_L00*_R1_001.fastq.gz > "$i"_ME_L001_R1_001.fastq.gz

       echo "Merging R2"

cat "$i"_L00*_R2_001.fastq.gz > "$i"_ME_L001_R2_001.fastq.gz

done;

Save script to text editor fastq_lane_merging.sh And make executable:

chmod +x fastq_lane_merging.sh

And run in folder with your FASTQ files like: ./fastq_lane_merging.sh

If you want understand what script do - I would recommend to try play with part of one-liner in your terminal and see what is happening.

Note: Best would be to add some check point to this script, if you have the same count of FASTQ files for each lane and both reads.

ADD COMMENT
0
Entering edit mode

Hi,

This has been a really useful script! Just wanted to know if there is a way to change it so that I can run when my fastq files are in different sub-directories? Currently it only works if they are in the same one.

Cheers!

ADD REPLY
0
Entering edit mode

What about to create link of fastq files or use output of find command?

ADD REPLY
0
Entering edit mode

Hi Paul, I am not good at programming. I at trying to use your code but I am not very successful. I have the following files but they end in _1 or _2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-407_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-407_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-405_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-405_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-406_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-406_2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-408_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-408_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-407_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-407_2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-405_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-405_2.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-408_1.fq.gz V300068047_L2_B5RDBATtnuRAAAAA-408_2.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-406_1.fq.gz V300066187_L4_B5RDBATtnuRAAAAA-406_2.fq.gz

Could you help me with the code? I tried using the cat loop but when I used the concatenated for assembly with MaSurca, it didn't detect any sequence. I did fastqc them and they are fine... I just want to have them in one file for forward and one for reverse, _1 and _2 respectively. And this has been very painful...

Thanks;

ADD REPLY
0
Entering edit mode

Hi of course,

please modify script to:

#!/bin/bash

for i in $(find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | rev | cut -c 33- | rev; done | sort | uniq)

    do echo "Merging R1"

cat "$i"_L*_B5RDBATtnuRAAAAA-405_1.fq.gz> "$i"_ME_L001_R1_001.fastq.gz

       echo "Merging R2"

cat "$i"_L*_B5RDBATtnuRAAAAA-405_2.fq.gz > "$i"_ME_L001_R2_001.fastq.gz

done;

And let me know if its works.

Paul.

ADD REPLY
0
Entering edit mode

Hi, Thank you for the scripts. But how to cat the files when they are present in different sub directories? The format is as follows, many folders like: sample1_lane1 sample1_lane2 sample1_lane3 sample1_lane4 Each of which has a R1 and R2 file. And in such a manner there are many samples. Can you help me with how to modify the command in such case?

Thanks in advance.

ADD REPLY
0
Entering edit mode

Hi, Thank you for the scripts. But how to cat the files when they are present in different sub directories? The format is as follows: sample1_lane1 sample1_lane2 sample1_lane3 sample1_lane4 Each of which has a R1 and R2 file. And in such a manner there are many samples. Can you help me with how to modify the command in such case?

Thanks in advance.

ADD REPLY
0
Entering edit mode

Use the following with extreme caution. While there will be more clever ways I am showing you one where you can cp copy (or preferably mv) all files into the current directory and then you can use @Paul's script or others in this thread.

This is the kind of directory structure I assume you have

$ ls -1R
Sample1_lane1
Sample1_lane2
Sample2_lane1
Sample2_lane2

./Sample1_lane1:
Sample1_L001_R1.fq.gz
Sample1_L001_R2.fq.gz

./Sample1_lane2:
Sample1_L002_R1.fq.gz
Sample1_L002_R2.fq.gz

./Sample2_lane1:
Sample2_L001_R1.fq.gz
Sample2_L001_R2.fq.gz

./Sample2_lane2:
Sample2_L002_R1.fq.gz
Sample2_L002_R2.fq.gz

Using

for i in $(find . -type d -name "Sample*"); do for j in $(find ${i} -type f -name "*.fq.gz"); do cp ${j} .; done; done

will cp (you can change the cp to mv) all fastq files to the top directory. You can also move them anywhere else by providing a new relative/full path.

ADD REPLY
0
Entering edit mode

Hi genomax, Thank you very much, this is very useful for me. Apologies but I have another query for the second part. This script worked well, but while trying to cat the files I was using the scripts given by you earlier.

My file name format is: IM1_Sxx_L001_R1_001.fastq.gz (similarly for L2, L3, L4), next sample as IM2_Syy_L001_R1_001.fastq.gz and so on.

While making a ID file from the first line of your code I modified it as follows: ls -1 R1.gz | awk -F '_' '{print $1}' | sort | uniq > ID ####made it $1 so as to save IM1, IM2 in the ID file.

However I am unable to modify the second half of your code, primarily because there is no common string in my code. So does that mean I will have to write it several times for each file?

I did try using Paul's code and it works well but I don't understand the -cut 22- in: for i in $(find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | rev | cut -c 22- | rev; done | sort | uniq)

What is it that we are trying to cut here?

Any help will be great.

Thank you

ADD REPLY
0
Entering edit mode

@genomax, thank you very much. I was able to modify and use your cat command. thanks a million

ADD REPLY
0
Entering edit mode

@pdhrati02 Would you type the script after the modifications you made? Thanks a lot.

ADD REPLY
0
Entering edit mode

Hi, Sure, I used the following:

ls -1 *R1*.gz | awk -F '_' '{print $1"_"$2}' | sort | uniq > ID

This above command made a file with all sample names in it and then:

for i in `cat ./ID`; 

do cat $i\_L001\_R1_001.fastq.gz $i\_L002\_R1_001.fastq.gz $i\_L003\_R1_001.fastq.gz $i\_L004\_R1_001.fastq.gz > $i\_R1_cat.fastq.gz; 

done;

Hope this helps

ADD REPLY
7
Entering edit mode
5.9 years ago
find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | cut -d _ -f 2,3  ; done | sort | uniq | while read P; do find ./ -type f -name "string_${P}_*.fastq.gz"  -exec cat '{}' ';'  > ${P}.merged.fq.gz ; done
ADD COMMENT
0
Entering edit mode

Thanks very much for this - could you explain how it works or point me in the right direction to understand?

ADD REPLY
0
Entering edit mode

Thanks again, so breaking it down it looks like it lists all files in the directory, truncates the directory values, extracts the 2nd and 3rd fields delimited by '_', sorts them (alphanumerically?), filters matching occurrences, and then runs a concatenate command for every file matching each line of the text to output to a file containing the sampleID.merged... Is that right? What does the semicolon ; denote in the cat command?

ADD REPLY
0
Entering edit mode
5.9 years ago

Depending on what you are using to align, you might be able to cat the samples on the fly, like

cat *.sample1*.fastq.gz | STAR ...

So if you already have a loop that is handling the variable for the sample name, just use that to build your command.

ADD COMMENT
0
Entering edit mode
3.7 years ago
Satyajeet Khare ★ 1.6k
cat *S1*R1* >> S1_R1.fastq.gz

Where S1 is sample name in sample sheet. This can also be done

folder="/Your_directory"; for value in $(cat ${folder}/sample_sheet.csv | tail -n +24 | tr -s ' ' | cut -d, -f1); do   cat ${folder}/Output/*S${value}*R1* >> ${folder}/Output/S${value}_R1.fastq.gz ; cat ${folder}/Output/*S${value}*R2* >> ${folder}/Output/S${value}_R2.fastq.gz  ; done

In this command tail -n +24 will chop top rows of the sample sheet that contain peripheral information. Take a look at your sample sheet and decide the number.

ADD COMMENT

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6