Combine files that have partial filename matches
1
0
Entering edit mode
15 months ago

Hi there,

I have multiple sequencing reads with the following structure (the below is just an example):

> MSP3_run719_TCATCCTA_S65_L004_R1_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L001_R1_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L002_R1_001.fastq.gz
> MSP3_run719_TCATCCTA_S65_L004_R2_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L001_R2_001.fastq.gz
> MSP3_run720_TCATCCTA_S45_L002_R2_001.fastq.gz

My goal is to combine reads that have the same names in columns 1 and 6 (delimiter _), and have an output as follows:

> MSP3_R1.fastq.gz
> MSP3_R2.fastq.gz

I tried to run the following command, but didn't work:

for file in MSP3_*_R1_*.fastq.gz ; do cat "$file" >>"${file%_*}.comb" ; done

Can someone help me out? Thanks!

Sequence • 507 views
ADD COMMENT
1
Entering edit mode

I think you just need ${file%%_*}, with two %, to get the "MSP3" part:

for file in MSP3_*_R1_*.fastq.gz ; do cat "$file" >>"${file%%_*}_R1.comb" ; done

(Or whatever file extension you want.)

With one % it only works on the last match, while with two it goes for the first match. Same idea in the other direction for # and ##.

$ x="one_two_three"
echo ${x%_*}
one_two
$ echo ${x%%_*}
one
$ echo ${x#*_}
two_three
$ echo ${x##*_}
three

To handle R1/R2 automatically, you'd probably just use some other string manipulation commands. (I end up abusing cut a lot of that sort of thing.) Just be careful since that can be a brittle way of coming at it. Like maybe:

for file in MSP3_*_R1_*.fastq.gz ; do cat "$file" >> "$(echo $file | cut -f 1,6 -d _).comb"; done
ADD REPLY
0
Entering edit mode
15 months ago

Hi

 for file in *.gz;do
     newFile=$(echo "$file"  | awk -F'_' '{print $1"_"$6}');
     cat $file >> $newFile.fastq.gz ;
 done
ADD COMMENT

Login before adding your answer.

Traffic: 2619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6