Question: How to make a loop to run kallisto bus command for so many fastq.gz files at once?
0
gravatar for F. Golestan
29 days ago by
F. Golestan20
F. Golestan20 wrote:

Hello. I am using kallisto | bustools workflow for single-cell RNA-seq analysis. In their workflow (https://www.kallistobus.tools/multiple_files_tutorial.html), they executed the below code to run kallisto bus command for four pair of fastq files as below:

$ kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -o bus_output/ -x 10xv2 -t 4 \
bamtofastq_S1_L001_R1_001.fastq.gz \
bamtofastq_S1_L001_R2_001.fastq.gz \
bamtofastq_S1_L002_R1_001.fastq.gz \
bamtofastq_S1_L002_R2_001.fastq.gz \
bamtofastq_S1_L003_R1_001.fastq.gz \
bamtofastq_S1_L003_R2_001.fastq.gz \
bamtofastq_S1_L004_R1_001.fastq.gz \
bamtofastq_S1_L004_R2_001.fastq.gz

However, as I have many fastq.gz files (around 800), it is tough to give all 800 files to kallisto bus command like the above approach. I really need to run all of them at once using a loop. So, I used a kallisto quant loop from a RNA-seq course to adapt it for kallisto bus command (for the same abovementioned paired samples) as below. However, It did not produce any output.

## Run kallisto:
# for four paired samples (-n 8):
find /scratch/fs/kallisto_bustools_multiple_lanes/fastqs -name "*_[R1R2]_001.fastq.gz" | sort | head -n 8 | while read 
FW_READ
do
 read RV_READ
 FILEBASE=$(basename "${FW_READ%_R1_001.fastq.gz}")
 kallisto bus -i /scratch/fs/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 \
 -o . -t 16 "$FW_READ" "$RV_READ"
 # Kallisto doesn't let us specify an output filename so we rename all output files
 mv "matrix.ec" $FILEBASE-"matrix.ec"
 mv "output.bus" $FILEBASE-"output.bus"
 mv "run_info.json" $FILEBASE-"run_info.json"
 mv "transcripts.txt" $FILEBASE-"transcripts.txt"
done
#

May I kindly ask you to help me how to fix this chunk of code to run a loop for kallisto bus command for all of samples at once? Thank you so much.

ADD COMMENTlink written 29 days ago by F. Golestan20
0
gravatar for genomax
29 days ago by
genomax74k
United States
genomax74k wrote:

Have you tried to include echo (so the commands do not actually run but are printed to screen so you can check the output) in your script at strategic locations to see where there may be an error? You can test it with a small number of files first.

ADD COMMENTlink modified 29 days ago • written 29 days ago by genomax74k

Thank you for your guide. I am not sure whether I have included echo in a correct way or not (also in good locations or not), but, I tried as below after adding echo:

echo find /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs -name "*_[R1R2]_001.fastq.gz" | sort | head -n 8 | while echo read FW_READ
do
 echo read RV_READ
 echo FILEBASE=$(basename "${FW_READ%_R1_001.fastq.gz}")
 echo kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 \
 -o . -t 16 "$FW_READ" "$RV_READ"
 # Kallisto doesn't let us specify an output filename so we rename all output files
 echo mv "matrix.ec" $FILEBASE-"matrix.ec"
 echo mv "output.bus" $FILEBASE-"output.bus"
 echo mv "run_info.json" $FILEBASE-"run_info.json"
 echo mv "transcripts.txt" $FILEBASE-"transcripts.txt"
done

But, It produced this Error: read -bash: FW_READ: command not found

ADD REPLYlink written 29 days ago by F. Golestan20
1

skip the echo before the find command, otherwise it won't read in the FV and RV_READ variables. Just echo the variable names and kallisto command. This should tell you the commands that will be run.

find /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs -name "*_[R1R2]_001.fastq.gz" | sort | head -n 8 | while read FW_READ
do
 read RV_READ
 FILEBASE=$(basename "${FW_READ%_R1_001.fastq.gz}")
 echo FILEBASE
 echo kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 \
 -o $FILEBASE -t 16 "$FW_READ" "$RV_READ"
done

You can also use folders to place the output for each file.

ADD REPLYlink modified 29 days ago • written 29 days ago by pmelsted110

Thanks a lot. I ran exactly as you suggested, however, it did not produce anything, even no error. It just resulted in linux command prompt sign ($).

ADD REPLYlink written 29 days ago by F. Golestan20
1

That indicates that the find command is not finding anything. Check the path and filename regex, maybe "*_001.fastq.gz" might work bettere

ADD REPLYlink written 29 days ago by pmelsted110

Thanks. The result of running the code with "_001.fastq.gz" (instead of "_[R1R2]_001.fastq.gz") is as below:

FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L001_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L001_R2_001.fastq.gz
FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L002_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L002_R2_001.fastq.gz
FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L003_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L003_R2_001.fastq.gz
FILEBASE
kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -o . -t 16 /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L004_R1_001.fastq.gz /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/bamtofastq_S1_L004_R2_001.fastq.gz

Now, It is showing the correct fastq.gz files, but it did not do kallisto bus.

ADD REPLYlink written 29 days ago by F. Golestan20

You can also make two directory for R1 and R2

mkdir R1
mkdir R2
mv *R1_001.fastq.gz R1
mv *R2_001.fastq.gz R2
cd R1
ls | cat -n | while read n f; do mv "$f" "$n"_$f; done
cd ../R2
ls | cat -n | while read n f; do mv "$f" "$n"_$f; done
cd ..
for i in $(seq 1 800); do kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -o $i\_bus_output/ -x 10xv2 -t 4 R1/$i\_*.gz  R2/$i\_.gz ; done

Hoping it will help you.

ADD REPLYlink modified 28 days ago • written 28 days ago by archana.bioinfo87170

Thank you very much for your help.

I also found another way of doing kallisto bus for many fastq files at once as below:

cd /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/fastqs/

kallisto bus -i /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/Mus_musculus.GRCm38.cdna.all.idx -o /scratch/fsgh1d18/kallisto_bustools_multiple_lanes/bus_output/ -x 10xv2 -t 4 $(find . -name "*.fastq.gz" | sort)

Thank you.

ADD REPLYlink written 27 days ago by F. Golestan20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 969 users visited in the last hour