Question

Writing a for loop with if statement to concatenate files with the same end of name in linux

1

Entering edit mode

7 months ago

arnoldhaley3 ▴ 10

I have a directory of paired-end fastq, some of which are the same individuals but extracted from different DNA samples, and I would like to concatenate these pairs. It would honestly be faster for me to do this manually pair by pair but I am trying to improve my bash/shell scripting skills and I'll need to be able to do this in the future.

The files follow this pattern:

Kam_L_39.1.fq
Kam_L_39.2.fq
Kam_L_48.1.fq
Kam_L_48.2.fq
Kam_T_39.1.fq
Kam_T_39.2.fq
Kam_T_48.1.fq
Kam_T_48.2.fq

I want to concatenate the T and L files for each sample number (first number after underscore) with its same read number, so for example concatenate Kam_L_39.1.fq with Kam_T_39.1.fq and Kam_L_39.2.fq with Kam_T_39.2.fq, and the same for sample 48. This directory also contains T file pairs that do not have a L pair matching set; I don't need to worry about those.

I think this would require a conditional for loop, something like: if the end of the file name (_sample.read.fq) has a match, then concatenate the files, and repeat for all files in the directory. If possible, I would like to keep the original L and T files, and name the concatenated files something like Kam_LT_39.1.fq and Kam_LT_39.2.fq

I did find questions which are similar, but I'm not sure how to account for the extra variability in my name convention. Tried to modify code from other questions with no success. Any help is very much appreciated as I am new to for loops, thank you!

shell linux bash • 721 views

ADD COMMENT • link 7 months ago by arnoldhaley3 ▴ 10

score 3 · Answer 1 · 2023-10-07

I personally like to use parameter expansion for these tasks. I'm not sure if it's a best practice though.

Also, as a note, whenever performing this type of task it's best to check that the parameter expansion will work as expected by doing a dry run using echo. This will print the commands instead of running them for you to check they are performing correctly. I will include an example:

for number in 39 48
do
  echo "cat Kam_L_${number}.1.fq  Kam_T_${number}.1.fq >  Kam_LT_${number}.1.fq"
  echo "cat Kam_L_${number}.2.fq  Kam_T_${number}.2.fq >  Kam_LT_${number}.2.fq"
done

Then you can observe the commands, make sure they make sense, and then run them by removing echo:

for number in 39 48
do
  cat Kam_L_${number}.1.fq  Kam_T_${number}.1.fq >  Kam_LT_${number}.1.fq
  cat Kam_L_${number}.2.fq  Kam_T_${number}.2.fq >  Kam_LT_${number}.2.fq
done

I like to explicitly specify as much of the file name as possible instead of using wildcard characters to reduce mistakes. For example, cat Kam_*_${number}.1.fq > Kam_LT_${number}.1.fq may also work, but the wildcard may allow additional files to be included.

If your other samples with "T" only files don't have the same numbers, then you should be fine to run it in the same directory. A quick alternative is to move the files you want to work with into a new directory.

For future use, a couple other key points is how to specify your for list, so 39 48 is easy to write at the command line, but if you have many different numbers or different strings you could do something like list them in a file and then for number in $(cat list_file).

It also works for filenames and wildcards

for file in Kam_L_??.?.fq
do
  cat ${file} ${file/_L_/_T_} > ${file/_L_/_LT_}
done

However, I think this example leaves more room for error, but it also highlights one of the fun parts of parameter expansion where you can specify substitutions on the fly, so here we substitute _L_ with _T_ or _LT_ depending on our needs.

A long answer for a relatively simple operation, but there is a very large amount of flexibility without even using sed or awk.