Question

How to cat fastq files but keep different samples separate

0

Entering edit mode

4.0 years ago

m.radz ▴ 10

I'm trying to find a fast way to cat multiple fastq files in a directory, but keep the output merged files separate for each samples. For example, if I have 2 samples split across multiple lanes in a directory:

Sample1_C_S4_L001_R1_001.fastq.gz Sample1_C_S4_L001_R2_001.fastq.gz Sample1_C_S4_L002_R1_001.fastq.gz Sample1_C_S4_L002_R2_001.fastq.gz Sample1_L_S4_L001_R1_001.fastq.gz Sample1_L_S4_L001_R2_001.fastq.gz Sample1_L_S4_L002_R1_001.fastq.gz Sample1_L_S4_L002_R2_001.fastq.gz

I've tried something like: for f in Sample1_*.fastq.gz; do cat $f > $f-merged.fastq.gz; done but I can't get it to work, the output I want is two files named Sample_1_C-merged.fastq.gz and Sample_1_L-merged.fastq.gz. I don't need to keep the R1 and R2 reads separate

next-gen sequencing dna sequence • 2.1k views

ADD COMMENT • link updated 4.0 years ago by ATpoint 81k • written 4.0 years ago by m.radz ▴ 10

0

Entering edit mode

Can you rerun demultiplexer and not split by lane.?

ADD REPLY • link 4.0 years ago by cpad0112 21k

score 2 · Answer 1 · 2020-04-03

2

Entering edit mode

4.0 years ago

Prakash ★ 2.2k

$f used in output will create output for each line.

you need to try :

for f in Sample1_C*.fastq.gz; do cat $f > Sample1_C_merged.fastq.gz; done
for f in Sample1_L*.fastq.gz; do cat $f > Sample1_L_merged.fastq.gz; done

ADD COMMENT • link 4.0 years ago by Prakash ★ 2.2k

0

Entering edit mode

Can I do that in a single command? Or does it have to be done separately?

ADD REPLY • link 4.0 years ago by m.radz ▴ 10

0

Entering edit mode

Can be in one loop, see my answer below and the concerns I have with both your strategy.

ADD REPLY • link 4.0 years ago by ATpoint 81k

score 1 · Answer 2 · 2020-04-03

1

Entering edit mode

4.0 years ago

ATpoint 81k

Careful, the solutions from both Prakash and swbarnes2 merge the paired-end data into a single file which is probably not what should be happening even though OP does not explicitely says so. I am reasonaby sure you want forward and reverse read being separated for proper alignment. @OP, this is paired end data so combining both into one file would double-count each sequenced fragment since every of the two paired-end reads comes from the same piece of DNA. Are you aware of that? Better do:

for f in Sample*.fastq.gz
  do
  Basename=${f%_L00*}
  ## merge R1
  ls ${Basename}_L00*_R1_001.fastq.gz | xargs cat > ${Basename}_merged_R1.fastq.gz
  ## merge R2
  ls ${Basename}_L00*_R2_001.fastq.gz | xargs cat > ${Basename}_merged_R2.fastq.gz
  done

ADD COMMENT • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

I thought that was what I wanted. But I confirmed with the sequencing centre that there is little to no overlap between the forward and reverse reads as the bulk of each library has fragments with inserts >300bp. Also I will be using these reads for metagenomic analysis (metaphlan2) which doesn't require merged reads.

Or am I going about this the completely wrong way, should I be keeping the forward and reverse reads separate during the trimming and filtering step?

ADD REPLY • link 4.0 years ago by m.radz ▴ 10

0

Entering edit mode

I am not into metagenomics so I do not kow for sure but for most other analysis like RNA-seq, ChIP, ATAC etc you always want to keep them separated. In your specific case it might also work that way but as said I do not know how the metagenomics field works. Others might comment on that. What could be a problem is that when you do any aalysis with tools such as Picard (e.g. to get rate of duplicates) it might be messed up since you have two reads with the same name not being flagged as paired. I'd keep them separated as this is how paired-end sequencing is typically interpreted by most tools.

ADD REPLY • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

If you don't need to merge your reads to run metaphlan (I can't remember what form of data it expects to ingest), then don't.

If you need all of the input reads in a single file, then cat them together with the suggestions above. It's not clear to me what the difference between your L and C samples are, so its hard to say whether it makes sense to 'pool' them or not, since apparently everything is Sample1.

ADD REPLY • link 4.0 years ago by Joe 21k

0

Entering edit mode

Or am I going about this the completely wrong way, should I be keeping the forward and reverse reads separate during the trimming and filtering step?

If your reads don't overlap (sounds like they mostly don't) then you should keep them separate. Some programs may expect interleaved data (Read1_R1, Read1_R2, Read2_R1, Read2_R2 etc). In that case you can interleave the data using reformat.sh from BBMap suite.

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

I agree with you ATpoint, paired-end data R1 and R2 should be kept separate. Thanks for highlighting this point. Few things are still no clear to me such as

no overlap between forward and reverse reads

merged reads are not required

also as Joe highlighted

difference between L and C samples

I hope @m.radz would be able to choose better.

ADD REPLY • link 4.0 years ago by Prakash ★ 2.2k

score 0 · Answer 3 · 2020-04-03

0

Entering edit mode

4.0 years ago

swbarnes2 14k

You don't want a loop.

cat Sample1_C*fastq.gz > Sample1_C.fastq.gz

ADD COMMENT • link 4.0 years ago by swbarnes2 14k

0

Entering edit mode

That will cat all the "C" files but not the "L". I know how to do them separately, I was hoping I could write a command that would cat the "C" files then move on to the "L" files automatically.

ADD REPLY • link 4.0 years ago by m.radz ▴ 10