Using comm to make a list of files that haven't yet been processed
0
0
Entering edit mode
6 weeks ago

I'm using comm to work out which files have already been processed and which are still to do. The input and output filenames are a little different, so I've used basename and sed to strip away the filepath and suffix information, so they can be compared.

DIR=/Users/michaelflower/Desktop/testing_todo2
TODO=$(comm -3 <(basename -a "$DIR"/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a "$DIR"/results/repeats/*output.txt | sed 's/_repeats_output.*//'))  To make a list of files to be processed by my program I then put the file path and suffix information back in: for i in$TODO; do echo "$DIR"/${i}"_R1_001.fastq.gz"; done


This is working great, except when the output directory is empty. When there's at least 1 file in the output directory I get the perfect output. But when the output directory is empty (or doesn't yet exist), the first entry in the list is *output.txt. This is messing up the script I'm using these file names in. Any idea how to remove that first entry?

/Users/michaelflower/Desktop/testing_todo2/*output.txt_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF1-JL125CAG-NPC-20210703_S5_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF2-JL125CAG-NPC-20210510_S3_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF3-130CAGiPSC-BL-20210521_S2_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF4-JL180CAG-NPC1-20211211_S4_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz

linux comm • 194 views
1
Entering edit mode

Using comm to make a list of files that haven't yet been processed

use a workflow manager like make, snakemake, nextflow....

0
Entering edit mode

This is working great,

this is a wrong usage of comm. both inputs MUST be sorted.