Removing diff syntax from its output
1
0
Entering edit mode
5 months ago

I'm using diff to work out which files have already been processed and which are still to do. The input and output filenames are a little different, so I've used basename and sed to strip away the filepath and suffix information, so they can be compared.

TODO=$(diff -s <(basename -a ./data/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a ./results/repeats/*output.txt | sed 's/_repeats_output.*//')) echo$TODO


This outputs

1d0 < MF3-130CAGiPSC-BL-20210521_S2_L001


Which is exactly what I want, except for the 1d0 < bit. I've been looking at the diff manual, and can't see how to get it to just output the filename and not it's default syntax (1d0 <). Any help please!

diff linux • 711 views
2
Entering edit mode
5 months ago

use comm , not diff.

0
Entering edit mode

This is great, and I accepted it, but I've got one other question. I'm now trying to add my filepath and suffix back onto the output and am having difficulty.

echo "$DIR"/${TODO}_R1_001.fastq.gz


produces:

/Users/michaelflower/Desktop/JL_MSH3/MF3-130CAGiPSC-BL-20210521_S2_L001 MF4-JL180CAG-NPC1-20211211_S4_L001 MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz


But what I want is the filepath and suffix added to each filename! I want to use this as the input to a function to list the files it needs to operate on

0
Entering edit mode

Not sure where is $DIR coming from but echo${DIR}/${TODO}_R1_001.fastq.gz should be all you need. You can capture dir name using dirname command. ADD REPLY 0 Entering edit mode This might make it clearer. You'll see I'm trying to add "/Users/michaelflower/Desktop/JL_MSH3" before each filename in$TODO and add "_R1_001.fastq.gz" after each.

echo "/Users/michaelflower/Desktop/JL_MSH3"/${TODO}_R1_001.fastq.gz  However, what I get is "/Users/michaelflower/Desktop/JL_MSH3", then all three filenames as a string with spaces in between, then just one "_R1_001.fastq.gz" at the end. /Users/michaelflower/Desktop/JL_MSH3/MF3-130CAGiPSC-BL-20210521_S2_L001 MF4-JL180CAG-NPC1-20211211_S4_L001 MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz  How do I get the comm output to be file names, as separate entities, so that we can add the prefix and suffix to each? ADD REPLY 0 Entering edit mode I see so that first part can be fixed by: echo "$DIR"/${TODO}"_R1_001.fastq.gz" comm -12 file1 file2 should only print lines that are common in two files. ADD REPLY 0 Entering edit mode I'm afraid I'm still getting the same problem. It seems to interpret the comm output as a single string, rather than separate filenames ... $ echo "$DIR"/${TODO}"_R1_001.fastq.gz"
/Users/michaelflower/Desktop/testing_todo/MF4-JL180CAG-NPC1-20211211_S4_L001 MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz

1
Entering edit mode

See the following:

$cat file1 MF3-130CAGiPSC-BL-20210521_S2_L001 MF4-JL180CAG-NPC1-20211211_S4_L001 MF5-JL180CAG-NPCp22-20211211_S1_L001 Sample1_S3_L001$ cat file2
MF5-JL180CAG-NPCp22-20211211_S1_L001
Sample1_S3_L001
Sample2_S5_L001

# following prints the common file names from two files above

$comm -12 file1 file2 MF5-JL180CAG-NPCp22-20211211_S1_L001 Sample1_S3_L001$ for i in $(comm -12 file1 file2); do echo${i}"_R1_001.fastq.gz"; done
MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz
Sample1_S3_L001_R1_001.fastq.gz

0
Entering edit mode

My hero, that for loop works perfectly!!

0
Entering edit mode

This is working great, except when the output directory is empty. When there's at least 1 file in the output directory I get the perfect output. But when the output directory is empty (or doesn't yet exist), the first entry in the list is *output.txt. This is messing up the script I'm using these file names in. Any idea how to remove that first entry?

TODO=$(comm -3 <(basename -a "$DIR"/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a "$DIR"/results/repeats/*output.txt | sed 's/_repeats_output.*//')) for i in$TODO; do echo "$DIR"/${i}"_R1_001.fastq.gz"; done
/Users/michaelflower/Desktop/testing_todo2/*output.txt_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF1-JL125CAG-NPC-20210703_S5_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF2-JL125CAG-NPC-20210510_S3_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF3-130CAGiPSC-BL-20210521_S2_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF4-JL180CAG-NPC1-20211211_S4_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz