Question

Clumpify Input via Process Substitution

1

Entering edit mode

6.4 years ago

dario.garvan ▴ 520

How can I input reads from a file descriptor rather than an actual text file? I have created four small test FASTQ files to identify the problem; two R1 files and two R2 files. If I use process substitution, Clumpify fails. If I firstly combine R1 reads into a text file on disk and R2 reads into another file and use them, then Clumpify works. The reason I am exploring this is because I have a data set of 51 whole genome DNA samples. 47 of them were sequenced in one lane and four of them were sequenced in two lanes. The command I used is

clumpify.sh -Xmx1g t=1 in1=<(cat ${R1[@]}) in2=<(cat ${R2[@]}) out1=$R1output out2=$R2output dedupe subs=1

This results in the error

Starting cris 0.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1

The only difference for the command which works successfully is in1=merged1.fastq and in2=merged2.fastq

Done!
Time:                           0.971 seconds.
Reads Processed:          4000  4.12k reads/sec
Bases Processed:          600k  0.62m bases/sec

I got this idea from the STAR RNA-seq aligner, which permits --readFilesIn <(gunzip -c reads.fastq.gz), for example.

On the university's computing cluster, version 37.98 of bbmap is installed. I could use the /scratch/ directory if all else fails.

bbmap clumpify • 2.9k views

ADD COMMENT • link 6.4 years ago by dario.garvan ▴ 520

0

Entering edit mode

Have you tried to do this using the java command directly instead of the shell wrapper (the .sh version you are using)?

ADD REPLY • link 6.4 years ago by GenoMax 152k

0

Entering edit mode

After your suggestion, I did. However, the error remains the same.

$ java -ea -cp /usr/local/bbmap/37.98/current/ clump.Clumpify in1=<(cat ${R1[@]}) in2=<(cat ${R2[@]}) out1=$R1output out2=$R2output dedupe subs=1

again produces

Starting cris 0.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1

ADD REPLY • link 6.4 years ago by dario.garvan ▴ 520

0

Entering edit mode

I suspect that clumpify is reopening the files internally, which won't work with a named pipe.

ADD REPLY • link 6.4 years ago by Devon Ryan 105k

0

Entering edit mode

I don't really understand why you want to do so? There is no reason to use cat in that syntax.

If there are simply files names in ${R1[@]} , then simply use them as value for in1 , even if they are gzipped (since BBmap accepts gzip files as in put as well).

ADD REPLY • link 6.4 years ago by lieven.sterck 15k

0

Entering edit mode

It looks like OP is trying to merge several files on the fly and pass them to clumpify.

Try using the syntax “${R1[*]}” instead (including quotes).

ADD REPLY • link 6.4 years ago by Joe 22k

2

Entering edit mode

6.4 years ago

ATpoint 88k

I am not sure how these arrays of yours look and how your files are named but the basic idea given you had a file that contains the file names to be merged would be to exploit the interleaved option of bbmap. First interleave the files with seqtk and them stdin them to clumpify:

seqtk mergepe \
  <(cat file.txt | awk '{print $1"_1.fastq.gz"}' | xargs cat) \
  <(cat file.txt | awk '{print $1"_2.fastq.gz"}' | xargs cat) | \
  clumpify.sh in1=stdin.fastq interleaved=true out=interleaved_out.fastq (your_additional_options)

Should be possible to modify it to work with your arrays. The advantage of interleaved files is that you can stream them directly into the aligner saving time and disk space especially when working with large files. BWA mem supports interleaved input with the -p option. Would then be something like:

(above command) out=stdout.fastq | bwa mem (opts...) -p /dev/stdin | samtools view -o out.bam

ADD COMMENT • link 6.4 years ago by ATpoint 88k

0

Entering edit mode

It's a good answer, but adds an additional software to the pipeline, increasing the future maintenance.

ADD REPLY • link 6.4 years ago by dario.garvan ▴ 520

1

Entering edit mode

6.4 years ago

dario.garvan ▴ 520

If there's a lot of extra space on the computer's scratch directory, the easiest to read code would be.

cat ${R1[@]} > /scratch/projectID/Patient1_merged_R1.fastq.gz
cat ${R2[@]} > /scratch/projectID/Patient1_merged_R2.fastq.gz
clumpify.sh -Xmx1g t=1 in1=/scratch/projectID/Patient1_merged_R1.fastq.gz in2=/scratch/projectID/Patient1_merged_R2.fastq.gz out1=$R1output out2=$R2output dedupe subs=1

ADD COMMENT • link 6.4 years ago by dario.garvan ▴ 520

0

Entering edit mode

You are using just a gig of RAM and one thread for this clumpify job?

ADD REPLY • link 6.4 years ago by GenoMax 152k

0

Entering edit mode

Yes, because I made a reduced example which has 1000 reads in each file to test out different approaches. I am using 512 GB RAM and multiple threads on actual data.

ADD REPLY • link 6.4 years ago by dario.garvan ▴ 520

score 3 · Accepted Answer · 2019-03-12

3

Entering edit mode

6.4 years ago

h.mon 35k

I think the problem is clumpify.sh won't work with process substitution, regardless of the arrays. Here is a solution similar to ATpoint but using arrays, and interleaving with paste and awk taken shamelessly from this gist:

paste <(zcat ${r1[@]}) <(zcat ${r2[@]}) \
  | paste - - - - \
  | awk -v OFS="\n" -v FS="\t" '{print($1,$3,$5,$7,$2,$4,$6,$8)}' \
  | clumpify.sh -Xmx1g t=1 int=t in=stdin.fastq out1=$R1output out2=$R2output dedupe subs=1

ADD COMMENT • link 6.4 years ago by h.mon 35k

0

Entering edit mode

I've chosen this as the best answer because it doesn't create extra dependencies.