Question

Rename filenames based on list

0

Entering edit mode

2.0 years ago

genomes_and_MGEs ▴ 10

Hi everyone, I have a bunch of file entries named as below in file1.txt:

5275_AA_run719_GAGATTCC_S520_L004_R1_001.fastq.gz
5275_A_run720_ATTACTCG_S84_L001_R1_001.fastq.gz
5275_AB_run719_GAGATTCC_S521_L004_R1_001.fastq.gz
5275_B_run720_ATTACTCG_S85_L002_R1_001.fastq.gz

I would like to rename the first two columns (separated by _) of each filename, according to the file correspondence.txt:

5275_A  MDF3
5275_B  MDF6
5275_AA MCO6
5275_AB MCO7

If I run

while read n k; do sed -i "s/$n/$k/g" file1.txt ; done < correspondence.txt

this will rename files in a wrong way. For example, the

5275_AA_run719_GAGATTCC_S520_L004_R1_001.fastq.gz

file will be renamed to

MDF3A_run719_GAGATTCC_S520_L004_R1_001.fastq.gz

instead of

MCO6_run719_GAGATTCC_S520_L004_R1_001.fastq.gz

Is there a way to optimize the above code?

Thank you.

sequence • 526 views

ADD COMMENT • link updated 2.0 years ago by Matthias Zepper 4.6k • written 2.0 years ago by genomes_and_MGEs ▴ 10

1

Entering edit mode

perhaps try this:

while read n k; do sed -i "s/${n}_/${k}_/g" file1.txt ; done < correspondence.txt

(expand the regex with a _ to make it more specific)

ADD REPLY • link 2.0 years ago by lieven.sterck 15k

0

Entering edit mode

If that doesn't suffice, you can additionally try to sort your correspondence.txt by length, putting the longer patterns first:

awk '{ print length($1), $0 | "sort -n -r" }' < correspondence.txt

7 5275_AB MCO7
7 5275_AA MCO6
6 5275_B  MDF6
6 5275_A  MDF3

If you use this file then with

while read m n k; do ...

it should process the longest and thus hopefully most specific patterns first and already have replaced those before the more generic patterns are processed.

ADD REPLY • link 2.0 years ago by Matthias Zepper 4.6k