Question

Help understanding "sed" command in a loop

1

Entering edit mode

2.4 years ago

valentinavan ▴ 50

Hi everyone,

I have 2 questions:

1) I have found this script online to run Kraken2 in a loop on paired ends. Although I know it works well, because I have compared the results with another loop I have, I am not really understanding what is doing.

for FILE in $(ls *_R1.fastq | sed 's/_R1.fastq//'); do kraken2 --db path/plant_db --memory-mapping --threads 8 --use-mpa-style --confidence 0.1 --report path/${FILE}_report.txt --paired ${FILE}_R1.fastq ${FILE}_R2.fastq --report-zero-counts --output path/${FILE}_taxa.txt; done

Can someone please explain to me how to read 's/_R1.fastq//' ? How does it know to use the _R2.fastq files too?

2) I have also written this script using GNU parallel which works well for single paired ends but I am not sure how to modify it if I want to use paired ends. Can someone help me out, please?

time parallel -j2 "kraken2 {} --threads 2 --db path/plant_db --gzip-compressed --confidence 0.1 --report {}report.txt --report-zero-counts --output {}taxa.txt" ::: path/*.fastq

Thanks a lot!

sed loop • 1.9k views

ADD COMMENT • link updated 2.4 years ago by ole.tange ★ 4.4k • written 2.4 years ago by valentinavan ▴ 50

score 3 · Answer 1 · 2021-11-10

Regarding first point:

 1. for FILE in $(ls *_R1.fastq | sed 's/_R1.fastq//') -- list all the files endings with `_R1.fastq` (ls *_R1.fastq) and remove the end string `_R1.fastq`  for all the listed files
 2. within loop, `${FILE}_R1.fastq ${FILE}_R2.fastq`  -- reconstruct R1 and R2 fastq.

Regarding second point:

$ parallel --dry-run 'kraken2 --db path/plant_db --memory-mapping --threads 8 --use-mpa-style --confidence 0.1 --report path/{=s/_R1.fastq//=}_report.txt --paired {} {=s/_R1.fastq//=}_R2.fastq --report-zero-counts --output path/{=s/_R1.fastq//=}_taxa.txt' ::: *_R1.fastq

Execute in the same directory where R1 and R2 are located. Parallel is a dry-run. Remove dry-run when you are okay with output commands. When you do not understand a loop, please use echo. While using parallel, you let either parallel handle threads (-j2) or program let handle threads (--threads 2).

score 2 · Answer 2 · 2021-11-11

2

Entering edit mode

2.4 years ago

lethalfang ▴ 140

The sed argument 's/_R1.fastq//' has 3 parts. Each is separated by /.

s means substitution
_R1.fastq is something to be substituted
The next is actually empty, i.e., substitute _R1.fastq with nothing.

So when ls generate a list of files with _R1.fastq at the end, sed will remove the _R1.fastq in each file. FILE becomes the prefixes of *_R1.fastq.

You can try to see what the FILE will be by echo $(ls *_R1.fastq | sed 's/_R1.fastq//').

ADD COMMENT • link 2.4 years ago by lethalfang ▴ 140

0

Entering edit mode

Yep cool I understood :) so I could have also used: $(ls *_R2.fastq | sed 's/_R2.fastq//') and the results would not have changed!

Thank you all! :)

ADD REPLY • link 2.4 years ago by valentinavan ▴ 50

0

Entering edit mode

The g stands for global, which means to apply the substitution to every instance on a line. Otherwise it'll just replace the first instance.

ADD REPLY • link 2.4 years ago by lethalfang ▴ 140