Question: fastq-dump stream to named pipe (fifo) to Trinity
0
gravatar for ooddiittyy
2.8 years ago by
ooddiittyy0
United States
ooddiittyy0 wrote:

So, fastq-dump has the ability to be run on just an SRA file accession number, such that the SRA is converted to FASTQ on-the-fly, and the SRA doesn't have to be written to disk.

I'm curious whether it would be possible to use fastq-dump to write to a named pipe (using mkfifo) and feed that into another program, for example Trinity, to run an assembly on the FASTQ file(s) without ever having to write all that data to disk. For large datasets, this could actually save quite a bit of time in aggregate.

Has anyone done something similar? I am going to try and experiment with the technique soon, but I a) don't know much about the mkfifo process to begin with and b) am unsure of how this procedure would work for paired-end data where fastq-dump is splitting the SRA file as it goes. How would one specify which output would go to which pipe?

I would welcome any thoughts from more experienced users!

UPDATE: Okay, so just for others who might stumble upon this, here is a brief description of one implementation of this technique to run with paired-end RNA-seq data:

fastq-dump SRA_file --split-files -I -Z | tee >(grep '@.*\.1\s' -A3 --no-group-separator > namedPipe_1) >(grep '@.*\.2\s' -A3 --no-group-separator > namedPipe_2) >/dev/null

This first requires the creation of two named pipes using mkfifo. For paired-end data, the -Z flag becomes problematic because it forces the data into a single stream. There are many ways to regain the two pairs, but the way I've elected to do it is to use --split-files to break up the stream beforehand, -I to append either ".1" or ".2" to the end of each header, and then use tee to duplicate the stream plus grep with a regex to parse the info from each pair back out into separate pipes for downstream use.

I have tested this with Trinity, running on each named pipe just as I would with a FASTQ file, and it seems to be working fine. While I am not 100% sure that Trinity won't try and go back to the original FASTQ files, the first thing Trinity does is take those FASTQ files and parse them into FASTA format, which it later concatenates into "both.fa", and so I'm pretty confident that this will work.

Thanks to everyone who repsonded! Hope this can be useful for someone else in the future.

ADD COMMENTlink modified 2.1 years ago by Biostar ♦♦ 20 • written 2.8 years ago by ooddiittyy0
0
gravatar for Pierre Lindenbaum
2.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

from the manual: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

 

Workflow and piping:
-O | --outdir <path> Output directory, default is current working directory ('.').
-Z | --stdout Output to stdout, all split data become joined into single stream.
    --gzip Compress output using gzip.
    --bzip2 Compress output using bzip2.

I don't know how the output looks like (interleaved fastq ?) but it should be possible to do something like:

fastq-dump (options) -Z | awk? | bwa mem -p REF.fa -

 

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Pierre Lindenbaum98k
1
fastq-dump --split-spot -Z  produces 8 line fastq format which can be piped to awk or perl to create separate streams. 

also, new java and python apis available from github ncbi/ngs   https://github.com/ncbi/ngs 

examples available for java and python could be extended to suit your needs:

https://github.com/ncbi/ngs/blob/master/ngs-java/examples/examples/FragTest.java 

ADD REPLYlink written 2.8 years ago by osullivanchristopher130

Thanks for the response! I am unsure of the advantage of --split-spot over --split-files, but you've outlined the general strategy I've decided on. I've messed around a bit with some simple awk regex parsing of the convolved --split-files -I -Z output, where the -I flag should allow me to separate the different reads back out, as their headers are appended with a "1" or "2" depending on their source. 

ADD REPLYlink written 2.8 years ago by ooddiittyy0

Thanks for this. It got me on (what I think will be) the right track.

ADD REPLYlink written 2.8 years ago by ooddiittyy0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 899 users visited in the last hour