Help understanding "sed" command in a loop
2
1
Entering edit mode
2.4 years ago
valentinavan ▴ 50

Hi everyone,

I have 2 questions:

1) I have found this script online to run Kraken2 in a loop on paired ends. Although I know it works well, because I have compared the results with another loop I have, I am not really understanding what is doing.

for FILE in $(ls *_R1.fastq | sed 's/_R1.fastq//'); do kraken2 --db path/plant_db --memory-mapping --threads 8 --use-mpa-style --confidence 0.1 --report path/${FILE}_report.txt --paired ${FILE}_R1.fastq ${FILE}_R2.fastq --report-zero-counts --output path/${FILE}_taxa.txt; done

Can someone please explain to me how to read 's/_R1.fastq//' ? How does it know to use the _R2.fastq files too?

2) I have also written this script using GNU parallel which works well for single paired ends but I am not sure how to modify it if I want to use paired ends. Can someone help me out, please?

time parallel -j2 "kraken2 {} --threads 2 --db path/plant_db --gzip-compressed --confidence 0.1 --report {}report.txt --report-zero-counts --output {}taxa.txt" ::: path/*.fastq

Thanks a lot!

sed loop • 1.9k views
ADD COMMENT
3
Entering edit mode
2.4 years ago

Regarding first point:

 1. for FILE in $(ls *_R1.fastq | sed 's/_R1.fastq//') -- list all the files endings with `_R1.fastq` (ls *_R1.fastq) and remove the end string `_R1.fastq`  for all the listed files
 2. within loop, `${FILE}_R1.fastq ${FILE}_R2.fastq`  -- reconstruct R1 and R2 fastq.

Regarding second point:

$ parallel --dry-run 'kraken2 --db path/plant_db --memory-mapping --threads 8 --use-mpa-style --confidence 0.1 --report path/{=s/_R1.fastq//=}_report.txt --paired {} {=s/_R1.fastq//=}_R2.fastq --report-zero-counts --output path/{=s/_R1.fastq//=}_taxa.txt' ::: *_R1.fastq

Execute in the same directory where R1 and R2 are located. Parallel is a dry-run. Remove dry-run when you are okay with output commands. When you do not understand a loop, please use echo. While using parallel, you let either parallel handle threads (-j2) or program let handle threads (--threads 2).

ADD COMMENT
1
Entering edit mode

Shorter: parallel --dry-run --rpl '{R} s/_R1.fastq//' 'kraken2 --db path/plant_db --memory-mapping --threads 8 --use-mpa-style --confidence 0.1 --report path/{R}_report.txt --paired {} {R}_R2.fastq --report-zero-counts --output path/{R}_taxa.txt' ::: *_R1.fastq

ADD REPLY
2
Entering edit mode
2.4 years ago
lethalfang ▴ 140

The sed argument 's/_R1.fastq//' has 3 parts. Each is separated by /.

  • s means substitution
  • _R1.fastq is something to be substituted
  • The next is actually empty, i.e., substitute _R1.fastq with nothing.

So when ls generate a list of files with _R1.fastq at the end, sed will remove the _R1.fastq in each file. FILE becomes the prefixes of *_R1.fastq.

You can try to see what the FILE will be by echo $(ls *_R1.fastq | sed 's/_R1.fastq//').

ADD COMMENT
0
Entering edit mode

Yep cool I understood :) so I could have also used: $(ls *_R2.fastq | sed 's/_R2.fastq//') and the results would not have changed!

Thank you all! :)

ADD REPLY
0
Entering edit mode

A tip, if you need to replace paths, you may use | instead of / as the separator, i.e., echo /PATH/TO/OLD/DIR/ | sed 's|OLD/DIR|NEW/DIRECTORY|g'

The g stands for global, which means to apply the substitution to every instance on a line. Otherwise it'll just replace the first instance.

ADD REPLY

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6