I have fastqs of samples from the first sequencing and second sequencing runs and they are kept in different directories like below:
First Run:
Data1
|_____fastq_folder
|_______ sample1
|______ sample1_L1_R1.fastq.gz
|______ sample1_L1_R2.fastq.gz
|______ sample1_L2_R1.fastq.gz
|______ sample1_L2_R2.fastq.gz
|_______ sample2
|______ sample2_L1_R1.fastq.gz
|______ sample2_L1_R2.fastq.gz
|______ sample2_L2_R1.fastq.gz
|______ sample2_L2_R2.fastq.gz
Second Run:
Data2
|_____fastq_folder
|_______ sample1
|______ sample1_L1_R1.fastq.gz
|______ sample1_L1_R2.fastq.gz
|_______ sample2
|______ sample2_L1_R1.fastq.gz
|______ sample2_L1_R2.fastq.gz
Usually, when I want to run Salmon or Kallisto on First Run
files which are in the directory Data1
in my script I give it like the below:
Let's say I'm inside directory Data1
where I have a script named kallisto.sh
. Inside the script, I have it like below to read the fastq files.
r1=$(ls $fastq_folder/$sample/$sample*_R1.fastq.gz)
r2=$(ls $fastq_folder/$sample/$sample*_R2.fastq.gz)
But now I would like to also use Second Run
files also in my script. How to make the change for r1
and r2
to read all the files in First Run
and also Second Run
?
P.S: I know there is a way to merge and then perform the analysis, but it might take huge time at my workplace.
in any case you should need to merge the data from Data1, they are run on different lanes but represent the same biological sample.
so something like
cat sample1_L1_R1.fastq.gz sample1_L2_R1.fastq.gz > sample1_R1.fastq.gz
(== join the data from different lanes in to one file per biological sample)Yes, I know this. Please check the last line of my post. It might take a huge time at my workplace for merging, so I'm looking for alternative way.
You can use
find
command with a certain depth like here: How to concatenate multiple fastq files (located in different directories) for each sample Issample1
naming consistent across folders and files?Why would this take huge time? It will take up space since you will duplicate the data for some time.
Not sure if this is what you're asking, but if the runs represent the same biological sample, you can just put them one right after another in kallisto:
kallisto quant -i index.idx -o output/ run1.r1.fq.gz run1.f2.fq.gz run2.r1.fq.gz run2.r2.fq.gz run3.r1.fq.gz run3.r2.fq.gz