Question: Clumpify Input via Process Substitution
1
gravatar for dario.garvan
5 months ago by
dario.garvan460
Australia
dario.garvan460 wrote:

How can I input reads from a file descriptor rather than an actual text file? I have created four small test FASTQ files to identify the problem; two R1 files and two R2 files. If I use process substitution, Clumpify fails. If I firstly combine R1 reads into a text file on disk and R2 reads into another file and use them, then Clumpify works. The reason I am exploring this is because I have a data set of 51 whole genome DNA samples. 47 of them were sequenced in one lane and four of them were sequenced in two lanes. The command I used is

clumpify.sh -Xmx1g t=1 in1=<(cat ${R1[@]}) in2=<(cat ${R2[@]}) out1=$R1output out2=$R2output dedupe subs=1

This results in the error

Starting cris 0.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1

The only difference for the command which works successfully is in1=merged1.fastq and in2=merged2.fastq

Done!
Time:                           0.971 seconds.
Reads Processed:          4000  4.12k reads/sec
Bases Processed:          600k  0.62m bases/sec

I got this idea from the STAR RNA-seq aligner, which permits --readFilesIn <(gunzip -c reads.fastq.gz), for example.

On the university's computing cluster, version 37.98 of bbmap is installed. I could use the /scratch/ directory if all else fails.

bbmap clumpify • 340 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by dario.garvan460

Have you tried to do this using the java command directly instead of the shell wrapper (the .sh version you are using)?

ADD REPLYlink written 5 months ago by genomax70k

After your suggestion, I did. However, the error remains the same.

$ java -ea -cp /usr/local/bbmap/37.98/current/ clump.Clumpify in1=<(cat ${R1[@]}) in2=<(cat ${R2[@]}) out1=$R1output out2=$R2output dedupe subs=1

again produces

Starting cris 0.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
ADD REPLYlink written 5 months ago by dario.garvan460

I suspect that clumpify is reopening the files internally, which won't work with a named pipe.

ADD REPLYlink written 5 months ago by Devon Ryan91k

I don't really understand why you want to do so? There is no reason to use cat in that syntax.

If there are simply files names in ${R1[@]} , then simply use them as value for in1 , even if they are gzipped (since BBmap accepts gzip files as in put as well).

ADD REPLYlink written 5 months ago by lieven.sterck5.6k

It looks like OP is trying to merge several files on the fly and pass them to clumpify.

Try using the syntax “${R1[*]}” instead (including quotes).

ADD REPLYlink written 5 months ago by jrj.healey13k
3
gravatar for h.mon
5 months ago by
h.mon27k
Brazil
h.mon27k wrote:

I think the problem is clumpify.sh won't work with process substitution, regardless of the arrays. Here is a solution similar to ATpoint but using arrays, and interleaving with paste and awk taken shamelessly from this gist:

paste <(zcat ${r1[@]}) <(zcat ${r2[@]}) \
  | paste - - - - \
  | awk -v OFS="\n" -v FS="\t" '{print($1,$3,$5,$7,$2,$4,$6,$8)}' \
  | clumpify.sh -Xmx1g t=1 int=t in=stdin.fastq out1=$R1output out2=$R2output dedupe subs=1
ADD COMMENTlink written 5 months ago by h.mon27k

I've chosen this as the best answer because it doesn't create extra dependencies.

ADD REPLYlink written 5 months ago by dario.garvan460

You can accept more than one if you'd like.

ADD REPLYlink written 5 months ago by jrj.healey13k
2
gravatar for ATpoint
5 months ago by
ATpoint21k
Germany
ATpoint21k wrote:

I am not sure how these arrays of yours look and how your files are named but the basic idea given you had a file that contains the file names to be merged would be to exploit the interleaved option of bbmap. First interleave the files with seqtk and them stdin them to clumpify:

seqtk mergepe \
  <(cat file.txt | awk '{print $1"_1.fastq.gz"}' | xargs cat) \
  <(cat file.txt | awk '{print $1"_2.fastq.gz"}' | xargs cat) | \
  clumpify.sh in1=stdin.fastq interleaved=true out=interleaved_out.fastq (your_additional_options)

Should be possible to modify it to work with your arrays. The advantage of interleaved files is that you can stream them directly into the aligner saving time and disk space especially when working with large files. BWA mem supports interleaved input with the -p option. Would then be something like:

(above command) out=stdout.fastq | bwa mem (opts...) -p /dev/stdin | samtools view -o out.bam
ADD COMMENTlink modified 5 months ago • written 5 months ago by ATpoint21k

It's a good answer, but adds an additional software to the pipeline, increasing the future maintenance.

ADD REPLYlink written 5 months ago by dario.garvan460
0
gravatar for dario.garvan
5 months ago by
dario.garvan460
Australia
dario.garvan460 wrote:

If there's a lot of extra space on the computer's scratch directory, the easiest to read code would be.

cat ${R1[@]} > /scratch/projectID/Patient1_merged_R1.fastq.gz
cat ${R2[@]} > /scratch/projectID/Patient1_merged_R2.fastq.gz
clumpify.sh -Xmx1g t=1 in1=/scratch/projectID/Patient1_merged_R1.fastq.gz in2=/scratch/projectID/Patient1_merged_R2.fastq.gz out1=$R1output out2=$R2output dedupe subs=1
ADD COMMENTlink written 5 months ago by dario.garvan460

You are using just a gig of RAM and one thread for this clumpify job?

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax70k

Yes, because I made a reduced example which has 1000 reads in each file to test out different approaches. I am using 512 GB RAM and multiple threads on actual data.

ADD REPLYlink written 5 months ago by dario.garvan460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1435 users visited in the last hour