Question: Concatenating fastq.gz files across lanes
0
gravatar for volvicpellegrino
12 months ago by
volvicpellegrino0 wrote:

Hi,

I have spent several hours trying to figure out the best approach to do this. It would have been quicker to manually do it, but I will need to do this in future.

I have 40 paired-end RNAseq samples that were read across 5 lanes. I therefore have 400 fastq.gz files that I would like to process in Kallisto. The file name structure is as follows:

string_laneID_sampleID_pairID.fastq.gz

'string' is the same for every file

I want to concatenate the 5 lane files for each of the 40 samples, rather than running Kallisto for 200 paired end samples (is this the correct approach?).

Can someone please advise on the best way to concatenate these files? I have some knowledge of python and could do a bash script if someone could explain what each part means. Thank you

rna-seq • 2.3k views
ADD COMMENTlink modified 12 months ago by swbarnes25.6k • written 12 months ago by volvicpellegrino0
8
gravatar for genomax
12 months ago by
genomax67k
United States
genomax67k wrote:

cat files per sample across lanes for R1/R2 reads separately. I assume string has no implication.

cat string_L001_sampleID_R1.fastq.gz string_L002_sampleID_R1.fastq.gz string_L003_sampleID_R1.fastq.gz string_L004_sampleID_R1.fastq.gz string_L005_sampleID_R1.fastq.gz > string_sampleID_R1.fastq.gz

cat string_L001_sampleID_R2.fastq.gz string_L002_sampleID_R2.fastq.gz string_L003_sampleID_R2.fastq.gz string_L004_sampleID_R2.fastq.gz string_L005_sampleID_R2.fastq.gz > string_sampleID_R2.fastq.gz

Since you know bash scripting this should be simple to do for you by selecting the right files and doing the above.

In future ask the sequencing center to use --no-lane-splitting option for Illumina's bcl2fastq program to get a single file per sample automatically.

ADD COMMENTlink modified 12 months ago • written 12 months ago by genomax67k

Thanks for your help - the trouble is I have no idea how to specify the right files in a for loop. I understand the principle, its how you construct the code that is the issue. I guess I might have more luck with python. I suppose I could try an os.walk across the directory and a for loop check for sampleID and then somehow execute a shell script to concatenate all files with a given sampleID into a new file.

ADD REPLYlink written 12 months ago by volvicpellegrino0
2

Let us do a very simple two step approach (@Pierre has a fancier one liner).

Step 1: Grab the unique sample ID's in a file

ls -1 *R1*.gz | awk -F '_' '{print $3}' | sort | uniq > ID

ID file should have

sampID1
sampID2
sampID3
sampID4
sampID5

Step 2: Walk through the ID file one record at a time to create the command line you need for each cat command. This can be done in more complex ways but I am using a command line that should be easy to understand.

for i in `cat ./ID`; do echo cat string_L001_$i\_R1.fastq.gz string_L002_$i\_R1.fastq.gz string_L003_$i\_R1.fastq.gz string_L004_$i\_R1.fastq.gz \> $i\_R1.fastq.gz; done

should get you output below (remove the word echo when everything looks good to actually execute the commands, repeat for R2 files.).

cat string_L001_sampID1_R1.fastq.gz string_L002_sampID1_R1.fastq.gz string_L003_sampID1_R1.fastq.gz string_L004_sampID1_R1.fastq.gz > sampID1_R1.fastq.gz
cat string_L001_sampID2_R1.fastq.gz string_L002_sampID2_R1.fastq.gz string_L003_sampID2_R1.fastq.gz string_L004_sampID2_R1.fastq.gz > sampID2_R1.fastq.gz
ADD REPLYlink modified 12 months ago • written 12 months ago by genomax67k

Thank you so much to both of you! That makes total sense.

ADD REPLYlink written 12 months ago by volvicpellegrino0

been using this method for a while, always been dubious of it since cating multiple .gz like this seems sketchy but it always seems to work..

ADD REPLYlink written 12 months ago by steve2.0k
4
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:
find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | cut -d _ -f 2,3  ; done | sort | uniq | while read P; do find ./ -type f -name "string_${P}_*.fastq.gz"  -exec cat '{}' ';'  > ${P}.merged.fq.gz ; done
ADD COMMENTlink written 12 months ago by Pierre Lindenbaum120k

Thanks very much for this - could you explain how it works or point me in the right direction to understand?

ADD REPLYlink written 12 months ago by volvicpellegrino0

Thanks again, so breaking it down it looks like it lists all files in the directory, truncates the directory values, extracts the 2nd and 3rd fields delimited by '_', sorts them (alphanumerically?), filters matching occurrences, and then runs a concatenate command for every file matching each line of the text to output to a file containing the sampleID.merged... Is that right? What does the semicolon ; denote in the cat command?

ADD REPLYlink written 12 months ago by volvicpellegrino0
4
gravatar for Paul
12 months ago by
Paul1.3k
European Union
Paul1.3k wrote:

If your FASTQ files have always suffix *_L00*_R*_001.fastq.gz , wich is standard Illumina output. This script should works - it concatenate fastq files according all lanes separate by R1 + R2:

#!/bin/bash

for i in $(find ./ -type f -name "*.fastq.gz" | while read F; do basename $F | rev | cut -c 22- | rev; done | sort | uniq)

    do echo "Merging R1"

cat "$i"_L00*_R1_001.fastq.gz > "$i"_ME_L001_R1_001.fastq.gz

       echo "Merging R2"

cat "$i"_L00*_R2_001.fastq.gz > "$i"_ME_L001_R2_001.fastq.gz

done;

Save script to text editor fastq_lane_merging.sh And make executable:

chmod +x fastq_lane_merging.sh

And run in folder with your FASTQ files like: ./fastq_lane_merging.sh

If you want understand what script do - I would recommend to try play with part of one-liner in your terminal and see what is happening.

Note: Best would be to add some check point to this script, if you have the same count of FASTQ files for each lane and both reads.

ADD COMMENTlink modified 5 months ago • written 12 months ago by Paul1.3k

Hi,

This has been a really useful script! Just wanted to know if there is a way to change it so that I can run when my fastq files are in different sub-directories? Currently it only works if they are in the same one.

Cheers!

ADD REPLYlink written 4 weeks ago by unawaz40
0
gravatar for swbarnes2
12 months ago by
swbarnes25.6k
United States
swbarnes25.6k wrote:

Depending on what you are using to align, you might be able to cat the samples on the fly, like

cat *.sample1*.fastq.gz | STAR ...

So if you already have a loop that is handling the variable for the sample name, just use that to build your command.

ADD COMMENTlink written 12 months ago by swbarnes25.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 792 users visited in the last hour