Question

How to parallelize fastq-dump command when reading SRA IDs from a .txt file?

2

Entering edit mode

8.0 years ago

bioinform ▴ 30

How to paralellize fastq-dump command when reading SRA IDs from a .txt file?

here is my working code without paralell, it downloads a pair of fastq files:

    list=`cat SRAIdFromPythonInput.txt` # list of the SRA record file  IDs.
     for i in $list
     do  echo $i
    ./fastq-dump --split-files $i -v
     done

How to rewrite it using parallel GNU to make it download all the data with SRA IDs written in .txt file, not a single pair of fastqs? How to apply pattern "cat list | parallel "do-something1 {} config-{} ; do-something2 < {}" | process-output" to these codes?

paralell gnu shell fastq-dump sra • 5.8k views

ADD COMMENT • link updated 8.0 years ago by ole.tange ★ 4.5k • written 8.0 years ago by bioinform ▴ 30

0

Entering edit mode

I'm too lazy to check/test: what would be the generated files for one given ID ?

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

2 fastqs with SRA ids as the names

ADD REPLY • link 8.0 years ago by bioinform ▴ 30

0

Entering edit mode

what would be the names ? ID.fq.gz ? ID.fastq ? ID_R1.fq ? ID_R1.fastq.gz ?

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

ID.fastq a pair of them, I use renaming code in the next step

SRR5656566_1.fastq and SRR5656566_2.fastq

ADD REPLY • link 8.0 years ago by bioinform ▴ 30

score 0 · Answer 1 · 2017-07-26

0

Entering edit mode

8.0 years ago

Pierre Lindenbaum 166k

using a Makefile

IDS=$(shell cat SRAIdFromPythonInput.txt)

%_2.fastq: %_1.fastq
    touch -c $@

%_1.fastq:
    ./fastq-dump --split-files $* -v && touch -c $@

all: $(addsuffix _2.fastq,$(IDS)) $(addsuffix _1.fastq,$(IDS))

invoke with make and the number of parallel jobs. e.g:

make -j 16

ADD COMMENT • link 8.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thank you for your efforts, could you please write these codes in a manner of the pattern of the GNU parallel: cat list | parallel "do-something1 {} config-{} ; do-something2 < {}" | process-output, why do you use Makefile? and is there any tutorial, article or a chapter on using it in bioinformatics? I have never used Makefile for NGS data processing. I found one at http://bsmith89.github.io/make-bml/

ADD REPLY • link 8.0 years ago by bioinform ▴ 30

1

Entering edit mode

could you please write these codes in a manner of the pattern of the GNU parallel

no

why do you use Makefile?

because it works, it's easy , standard, ubiquitous, universal , etc...

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thanks, need code examples using GNU parallel, however,

ADD REPLY • link 8.0 years ago by bioinform ▴ 30

score 0 · Answer 2 · 2017-07-27

0

Entering edit mode

8.0 years ago

ole.tange ★ 4.5k

It is unclear to me what SRAIdFromPythonInput.txt contains. Can you give a couple of lines as example?

doit() {
  i="$1"
  echo "$i"
  ./fastq-dump --split-files $i -v
}
export -f doit
parallel doit :::: SRAIdFromPythonInput.txt

ADD COMMENT • link 8.0 years ago by ole.tange ★ 4.5k

0

Entering edit mode

It contains a column of SRA IDs:

 SRR5656566
 SRR5656567
 SRR5656518
 SRR5656500

thx

ADD REPLY • link 8.0 years ago by bioinform ▴ 30