Removing diff syntax from its output
1
0
Entering edit mode
2.4 years ago

I'm using diff to work out which files have already been processed and which are still to do. The input and output filenames are a little different, so I've used basename and sed to strip away the filepath and suffix information, so they can be compared.

TODO=$(diff -s <(basename -a ./data/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a ./results/repeats/*output.txt | sed 's/_repeats_output.*//'))
echo $TODO

This outputs

1d0 < MF3-130CAGiPSC-BL-20210521_S2_L001

Which is exactly what I want, except for the 1d0 < bit. I've been looking at the diff manual, and can't see how to get it to just output the filename and not it's default syntax (1d0 <). Any help please!

diff linux • 1.6k views
ADD COMMENT
2
Entering edit mode
2.4 years ago

use comm , not diff.

ADD COMMENT
0
Entering edit mode

This is great, and I accepted it, but I've got one other question. I'm now trying to add my filepath and suffix back onto the output and am having difficulty.

echo "$DIR"/${TODO}_R1_001.fastq.gz

produces:

/Users/michaelflower/Desktop/JL_MSH3/MF3-130CAGiPSC-BL-20210521_S2_L001 MF4-JL180CAG-NPC1-20211211_S4_L001 MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz

But what I want is the filepath and suffix added to each filename! I want to use this as the input to a function to list the files it needs to operate on

ADD REPLY
0
Entering edit mode

Not sure where is $DIR coming from but echo ${DIR}/${TODO}_R1_001.fastq.gz should be all you need. You can capture dir name using dirname command.

ADD REPLY
0
Entering edit mode

This might make it clearer. You'll see I'm trying to add "/Users/michaelflower/Desktop/JL_MSH3" before each filename in $TODO and add "_R1_001.fastq.gz" after each.

echo "/Users/michaelflower/Desktop/JL_MSH3"/${TODO}_R1_001.fastq.gz

However, what I get is "/Users/michaelflower/Desktop/JL_MSH3", then all three filenames as a string with spaces in between, then just one "_R1_001.fastq.gz" at the end.

/Users/michaelflower/Desktop/JL_MSH3/MF3-130CAGiPSC-BL-20210521_S2_L001 MF4-JL180CAG-NPC1-20211211_S4_L001 MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz

How do I get the comm output to be file names, as separate entities, so that we can add the prefix and suffix to each?

ADD REPLY
0
Entering edit mode

I see so that first part can be fixed by: echo "$DIR"/${TODO}"_R1_001.fastq.gz"

comm -12 file1 file2 should only print lines that are common in two files.

ADD REPLY
0
Entering edit mode

I'm afraid I'm still getting the same problem. It seems to interpret the comm output as a single string, rather than separate filenames ...

$ echo "$DIR"/${TODO}"_R1_001.fastq.gz"
/Users/michaelflower/Desktop/testing_todo/MF4-JL180CAG-NPC1-20211211_S4_L001 MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz
ADD REPLY
1
Entering edit mode

See the following:

$ cat file1
MF3-130CAGiPSC-BL-20210521_S2_L001
MF4-JL180CAG-NPC1-20211211_S4_L001
MF5-JL180CAG-NPCp22-20211211_S1_L001
Sample1_S3_L001

$ cat file2
MF5-JL180CAG-NPCp22-20211211_S1_L001
Sample1_S3_L001
Sample2_S5_L001

# following prints the common file names from two files above

$ comm -12 file1 file2
MF5-JL180CAG-NPCp22-20211211_S1_L001
Sample1_S3_L001

$ for i in $(comm -12 file1 file2); do echo ${i}"_R1_001.fastq.gz"; done
MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz
Sample1_S3_L001_R1_001.fastq.gz
ADD REPLY
0
Entering edit mode

My hero, that for loop works perfectly!!

ADD REPLY
0
Entering edit mode

This is working great, except when the output directory is empty. When there's at least 1 file in the output directory I get the perfect output. But when the output directory is empty (or doesn't yet exist), the first entry in the list is *output.txt. This is messing up the script I'm using these file names in. Any idea how to remove that first entry?

TODO=$(comm -3 <(basename -a "$DIR"/*_R1_001.fastq.gz | sed 's/_R1.*//') <(basename -a "$DIR"/results/repeats/*output.txt | sed 's/_repeats_output.*//'))
for i in $TODO; do echo "$DIR"/${i}"_R1_001.fastq.gz"; done
/Users/michaelflower/Desktop/testing_todo2/*output.txt_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF1-JL125CAG-NPC-20210703_S5_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF2-JL125CAG-NPC-20210510_S3_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF3-130CAGiPSC-BL-20210521_S2_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF4-JL180CAG-NPC1-20211211_S4_L001_R1_001.fastq.gz
/Users/michaelflower/Desktop/testing_todo2/MF5-JL180CAG-NPCp22-20211211_S1_L001_R1_001.fastq.gz
ADD REPLY

Login before adding your answer.

Traffic: 2362 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6