How to merge folders that have similar (not the same) names, in a folder keeping as its name the common part of the name of the two folders
1
0
Entering edit mode
8 weeks ago
Mania • 0

Hello, everybody

I have a list of folders that look like this: 123_S1_R1_001 AND 123_S1_R2_001

456_S25_R1_001 AND 456_S25_R2_001

789_S83_R1_001 AND 789_S83_R2_001

Each of the folder contains a fastq file (eg the 123_S1_R1_001 folder contains the file 123_S1_R1_001.fastq.gz in ).

I would like to merge for example the folders 123_S1_R1_001 AND 123_S1_R2_001, into a folder named 123_S1 keeping both fastq.gz files in it. The dataset contains many files, so the task cannot be done manually.

Thank u for your time, Mania

folders merge name-based • 791 views
ADD COMMENT
0
Entering edit mode

are you capable (experienced enough) to write a bash script?

ADD REPLY
0
Entering edit mode

Unfortunately, I'm kinda new to it.

I've already run the following script, which however moves each fastq in a directory, which has the name as the fastq file.

for folder in test/ ; do 
echo $folder
cd $folder
ls
for file in *S{$}*.fastq.gz; 
do mkdir -- "${file%.fastq.gz}"; 
mv -- "$file" "${file%.fastq.gz}"; 
done
cd ../
done
ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Thank you.

ADD REPLY
0
Entering edit mode

Before

$ tree .
.
├── 123_S1_R1_001
│   └── 123_S1_R1_001.fastq.gz
├── 123_S1_R2_001
│   └── 123_S1_R2_001.fastq.gz
├── 456_S25_R1_001
│   └── 456_S25_R1_001.fastq.gz
└── 456_S25_R2_001
    └── 456_S25_R2_001.fastq.gz

4 directories, 4 files

in bash/z shell:

$ find . -type f -name "*_R1_*" | while read line; do mkdir -p ${line%%_R1_*};  cp $line ${line//R1/R2} ${line%%_R1_*}/ ;done

After

$  tree .
.
├── 123_S1
│   ├── 123_S1_R1_001.fastq.gz
│   └── 123_S1_R2_001.fastq.gz
├── 123_S1_R1_001
│   └── 123_S1_R1_001.fastq.gz
├── 123_S1_R2_001
│   └── 123_S1_R2_001.fastq.gz
├── 456_S25
│   ├── 456_S25_R1_001.fastq.gz
│   └── 456_S25_R2_001.fastq.gz
├── 456_S25_R1_001
│   └── 456_S25_R1_001.fastq.gz
└── 456_S25_R2_001
    └── 456_S25_R2_001.fastq.gz

6 directories, 8 files

with parallel:

$ find . -type f -name "*_R1_*" | parallel --plus --dry-run  'mkdir -p {=s/_R1.*//g=} &&  mv {} {=s/R1/R2/g=} {=s/_R1.*//g=}/'

Remove dry-run after checking dry run output.

ADD REPLY
0
Entering edit mode
8 weeks ago
Dunois ★ 1.5k

Here's a one-liner in bash. Assuming you have all these directories under fastq_dirs, execute this command from the parent directory of fastq_dirs. It'll create a directory named new_fastq_dirs at that location with the FASTQ files arranged as you've described.

for DIR in fastq_dirs/*; do BNAME=$(basename $DIR); NDIR=$(echo $BNAME | grep -oP "^[0-9]+_[A-Z]+[0-9]+"); echo "Creating " $NDIR " and moving files from " ${BNAME}; mkdir -p new_fastq_dirs/${NDIR}; cp ${DIR}/*.fastq.gz new_fastq_dirs/${NDIR}; done

The command should also print something like this to the terminal during its execution:

Creating  123_S1  and moving files from  123_S1_R1_001
Creating  123_S1  and moving files from  123_S1_R2_001
Creating  456_S25  and moving files from  456_S25_R1_001
Creating  456_S25  and moving files from  456_S25_R2_001
ADD COMMENT
0
Entering edit mode

nice one. Will that however not throw a warning/error as it tries to create a dir that already exists?

ADD REPLY
1
Entering edit mode

It wouldn't because mkdir is being invoked with the -p switch.

From man mkdir:

-p, --parents
              no error if existing, make parent directories as needed
ADD REPLY
0
Entering edit mode

indeed, so it is. #TIL :)

I thought it would not complain the parents existing but apparently also the new-to-create folder name.

ADD REPLY
0
Entering edit mode

Thank you for your input Dunois!

I have run the command you proposed, however, what was basically created is a copy of the directories (fastq included) present in the fastq_dirs directory. The R1 and R2.fastq are still in separate dirs (eg 123_S1_R1_001 AND 123_S1_R2_001) not in the same dir named eg 123_S1.

ADD REPLY
0
Entering edit mode

Could you please share the exact command that you happened to execute? What you described shouldn't have happened.

Also, I must apologize. There was a typo in the code I shared with you (newdirs instead of new_fastq_dirs). I've fixed this now.

ADD REPLY

Login before adding your answer.

Traffic: 2165 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6