Using comm to make a list of files that haven't yet been processed
0
0
Entering edit mode
2.4 years ago

I'm using comm to work out which files have already been processed and which are still to do. The input and output filenames are a little different, so I've used basename and sed to strip away the filepath and suffix information, so they can be compared.

DIR=/Users/michaelflower/Desktop/testing_todo2
TODO=$(comm -3 <(basename -a "$DIR"/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a "$DIR"/results/repeats/*output.txt | sed 's/_repeats_output.*//'))

To make a list of files to be processed by my program I then put the file path and suffix information back in:

for i in $TODO; do echo "$DIR"/${i}"_R1_001.fastq.gz"; done

This is working great, except when the output directory is empty. When there's at least 1 file in the output directory I get the perfect output. But when the output directory is empty (or doesn't yet exist), the first entry in the list is *output.txt. This is messing up the script I'm using these file names in. Any idea how to remove that first entry?

/Users/michaelflower/Desktop/testing_todo2/*output.txt_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF1-JL125CAG-NPC-20210703_S5_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF2-JL125CAG-NPC-20210510_S3_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF3-130CAGiPSC-BL-20210521_S2_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF4-JL180CAG-NPC1-20211211_S4_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz
linux comm • 662 views
ADD COMMENT
1
Entering edit mode

Using comm to make a list of files that haven't yet been processed

use a workflow manager like make, snakemake, nextflow....

ADD REPLY
0
Entering edit mode

This is working great,

this is a wrong usage of comm. both inputs MUST be sorted.

ADD REPLY

Login before adding your answer.

Traffic: 2747 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6