Question: Clumpify Input via Process Substitution
1
gravatar for dario.garvan
5 weeks ago by
dario.garvan440
Australia
dario.garvan440 wrote:

How can I input reads from a file descriptor rather than an actual text file? I have created four small test FASTQ files to identify the problem; two R1 files and two R2 files. If I use process substitution, Clumpify fails. If I firstly combine R1 reads into a text file on disk and R2 reads into another file and use them, then Clumpify works. The reason I am exploring this is because I have a data set of 51 whole genome DNA samples. 47 of them were sequenced in one lane and four of them were sequenced in two lanes. The command I used is

clumpify.sh -Xmx1g t=1 in1=<(cat ${R1[@]}) in2=<(cat ${R2[@]}) out1=$R1output out2=$R2output dedupe subs=1

This results in the error

Starting cris 0.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1

The only difference for the command which works successfully is in1=merged1.fastq and in2=merged2.fastq

Done!
Time:                           0.971 seconds.
Reads Processed:          4000  4.12k reads/sec
Bases Processed:          600k  0.62m bases/sec

I got this idea from the STAR RNA-seq aligner, which permits --readFilesIn <(gunzip -c reads.fastq.gz), for example.

On the university's computing cluster, version 37.98 of bbmap is installed. I could use the /scratch/ directory if all else fails.

bbmap clumpify • 236 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by dario.garvan440

Have you tried to do this using the java command directly instead of the shell wrapper (the .sh version you are using)?

ADD REPLYlink written 5 weeks ago by genomax65k

After your suggestion, I did. However, the error remains the same.

$ java -ea -cp /usr/local/bbmap/37.98/current/ clump.Clumpify in1=<(cat ${R1[@]}) in2=<(cat ${R2[@]}) out1=$R1output out2=$R2output dedupe subs=1

again produces

Starting cris 0.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
ADD REPLYlink written 5 weeks ago by dario.garvan440

I suspect that clumpify is reopening the files internally, which won't work with a named pipe.

ADD REPLYlink written 5 weeks ago by Devon Ryan89k

I don't really understand why you want to do so? There is no reason to use cat in that syntax.

If there are simply files names in ${R1[@]} , then simply use them as value for in1 , even if they are gzipped (since BBmap accepts gzip files as in put as well).

ADD REPLYlink written 5 weeks ago by lieven.sterck4.5k

It looks like OP is trying to merge several files on the fly and pass them to clumpify.

Try using the syntax “${R1[*]}” instead (including quotes).

ADD REPLYlink written 5 weeks ago by jrj.healey12k
3
gravatar for h.mon
5 weeks ago by
h.mon24k
Brazil
h.mon24k wrote:

I think the problem is clumpify.sh won't work with process substitution, regardless of the arrays. Here is a solution similar to ATpoint but using arrays, and interleaving with paste and awk taken shamelessly from this gist:

paste <(zcat ${r1[@]}) <(zcat ${r2[@]}) \
  | paste - - - - \
  | awk -v OFS="\n" -v FS="\t" '{print($1,$3,$5,$7,$2,$4,$6,$8)}' \
  | clumpify.sh -Xmx1g t=1 int=t in=stdin.fastq out1=$R1output out2=$R2output dedupe subs=1
ADD COMMENTlink written 5 weeks ago by h.mon24k

I've chosen this as the best answer because it doesn't create extra dependencies.

ADD REPLYlink written 5 weeks ago by dario.garvan440

You can accept more than one if you'd like.

ADD REPLYlink written 5 weeks ago by jrj.healey12k
2
gravatar for ATpoint
5 weeks ago by
ATpoint15k
Germany
ATpoint15k wrote:

I am not sure how these arrays of yours look and how your files are named but the basic idea given you had a file that contains the file names to be merged would be to exploit the interleaved option of bbmap. First interleave the files with seqtk and them stdin them to clumpify:

seqtk mergepe \
  <(cat file.txt | awk '{print $1"_1.fastq.gz"}' | xargs cat) \
  <(cat file.txt | awk '{print $1"_2.fastq.gz"}' | xargs cat) | \
  clumpify.sh in1=stdin.fastq interleaved=true out=interleaved_out.fastq (your_additional_options)

Should be possible to modify it to work with your arrays. The advantage of interleaved files is that you can stream them directly into the aligner saving time and disk space especially when working with large files. BWA mem supports interleaved input with the -p option. Would then be something like:

(above command) out=stdout.fastq | bwa mem (opts...) -p /dev/stdin | samtools view -o out.bam
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ATpoint15k

It's a good answer, but adds an additional software to the pipeline, increasing the future maintenance.

ADD REPLYlink written 5 weeks ago by dario.garvan440
0
gravatar for dario.garvan
5 weeks ago by
dario.garvan440
Australia
dario.garvan440 wrote:

If there's a lot of extra space on the computer's scratch directory, the easiest to read code would be.

cat ${R1[@]} > /scratch/projectID/Patient1_merged_R1.fastq.gz
cat ${R2[@]} > /scratch/projectID/Patient1_merged_R2.fastq.gz
clumpify.sh -Xmx1g t=1 in1=/scratch/projectID/Patient1_merged_R1.fastq.gz in2=/scratch/projectID/Patient1_merged_R2.fastq.gz out1=$R1output out2=$R2output dedupe subs=1
ADD COMMENTlink written 5 weeks ago by dario.garvan440

You are using just a gig of RAM and one thread for this clumpify job?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax65k

Yes, because I made a reduced example which has 1000 reads in each file to test out different approaches. I am using 512 GB RAM and multiple threads on actual data.

ADD REPLYlink written 5 weeks ago by dario.garvan440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1584 users visited in the last hour