How to make a loop to run kallisto bus command for so many fastq.gz files at once?
1
0
Entering edit mode
5.1 years ago
Farah ▴ 80

Hello

I am using kallisto | bustools workflow for single-cell RNA-seq analysis. In their workflow (https://www.kallistobus.tools/multiple_files_tutorial.html), they executed the below code to run kallisto bus command for four pair of fastq files as below:

$ kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -o bus_output/ -x 10xv2 -t 4 \
bamtofastq_S1_L001_R1_001.fastq.gz \
bamtofastq_S1_L001_R2_001.fastq.gz \
bamtofastq_S1_L002_R1_001.fastq.gz \
bamtofastq_S1_L002_R2_001.fastq.gz \
bamtofastq_S1_L003_R1_001.fastq.gz \
bamtofastq_S1_L003_R2_001.fastq.gz \
bamtofastq_S1_L004_R1_001.fastq.gz \
bamtofastq_S1_L004_R2_001.fastq.gz

However, as I have many fastq.gz files (around 800), it is tough to give all 800 files to kallisto bus command like the above approach. I really need to run all of them at once using a loop. So, I used a kallisto quant loop from a RNA-seq course to adapt it for kallisto bus command (for the same abovementioned paired samples) as below. However, It did not produce any output.

## Run kallisto:
# for four paired samples (-n 8):
find /scratch/fs/kallisto_bustools_multiple_lanes/fastqs -name "*_[R1R2]_001.fastq.gz" | sort | head -n 8 | while read 
FW_READ
do
 read RV_READ
 FILEBASE=$(basename "${FW_READ%_R1_001.fastq.gz}")
 kallisto bus -i /scratch/fs/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 \
 -o . -t 16 "$FW_READ" "$RV_READ"
 # Kallisto doesn't let us specify an output filename so we rename all output files
 mv "matrix.ec" $FILEBASE-"matrix.ec"
 mv "output.bus" $FILEBASE-"output.bus"
 mv "run_info.json" $FILEBASE-"run_info.json"
 mv "transcripts.txt" $FILEBASE-"transcripts.txt"
done
#

May I kindly ask you to help me how to fix this chunk of code to run a loop for kallisto bus command for all of samples at once? Thank you so much.

bash kallisto bustools • 4.2k views
ADD COMMENT
0
Entering edit mode
5.1 years ago
GenoMax 147k

Have you tried to include echo (so the commands do not actually run but are printed to screen so you can check the output) in your script at strategic locations to see where there may be an error? You can test it with a small number of files first.

ADD COMMENT
0
Entering edit mode

Thank you for your guide. I am not sure whether I have included echo in a correct way or not (also in good locations or not), but, I tried as below after adding echo:

echo find /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs -name "*_[R1R2]_001.fastq.gz" | sort | head -n 8 | while echo read FW_READ
do
 echo read RV_READ
 echo FILEBASE=$(basename "${FW_READ%_R1_001.fastq.gz}")
 echo kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 \
 -o . -t 16 "$FW_READ" "$RV_READ"
 # Kallisto doesn't let us specify an output filename so we rename all output files
 echo mv "matrix.ec" $FILEBASE-"matrix.ec"
 echo mv "output.bus" $FILEBASE-"output.bus"
 echo mv "run_info.json" $FILEBASE-"run_info.json"
 echo mv "transcripts.txt" $FILEBASE-"transcripts.txt"
done

But, It produced this Error: read -bash: FW_READ: command not found

ADD REPLY
1
Entering edit mode

skip the echo before the find command, otherwise it won't read in the FV and RV_READ variables. Just echo the variable names and kallisto command. This should tell you the commands that will be run.

find /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs -name "*_[R1R2]_001.fastq.gz" | sort | head -n 8 | while read FW_READ
do
 read RV_READ
 FILEBASE=$(basename "${FW_READ%_R1_001.fastq.gz}")
 echo FILEBASE
 echo kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 \
 -o $FILEBASE -t 16 "$FW_READ" "$RV_READ"
done

You can also use folders to place the output for each file.

ADD REPLY
0
Entering edit mode

Thanks a lot. I ran exactly as you suggested, however, it did not produce anything, even no error. It just resulted in linux command prompt sign ($).

ADD REPLY
1
Entering edit mode

That indicates that the find command is not finding anything. Check the path and filename regex, maybe "*_001.fastq.gz" might work bettere

ADD REPLY
0
Entering edit mode

Thanks. The result of running the code with "_001.fastq.gz" (instead of "_[R1R2]_001.fastq.gz") is as below:

FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L001_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L001_R2_001.fastq.gz
FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L002_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L002_R2_001.fastq.gz
FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L003_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L003_R2_001.fastq.gz
FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L004_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L004_R2_001.fastq.gz

Now, It is showing the correct fastq.gz files, but it did not do kallisto bus.

ADD REPLY
0
Entering edit mode

You can also make two directory for R1 and R2

mkdir R1
mkdir R2
mv *R1_001.fastq.gz R1
mv *R2_001.fastq.gz R2
cd R1
ls | cat -n | while read n f; do mv "$f" "$n"_$f; done
cd ../R2
ls | cat -n | while read n f; do mv "$f" "$n"_$f; done
cd ..
for i in $(seq 1 800); do kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -o $i\_bus_output/ -x 10xv2 -t 4 R1/$i\_*.gz  R2/$i\_.gz ; done

Hoping it will help you.

ADD REPLY
0
Entering edit mode

Thank you very much for your help.

I also found another way of doing kallisto bus for many fastq files at once as below:

cd /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/

kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -o /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/bus_output/ -x 10xv2 -t 4 $(find . -name "*.fastq.gz" | sort)

Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 814 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6