Question: GNU Parallel Block Issues
2
gravatar for salamayg
4.0 years ago by
salamayg20
Canada
salamayg20 wrote:

Hello,

I am using GNU parallel to speed up my BLAST jobs. I have seen the example outlined in the following post (Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them) and used the command:

 

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -db db.fa -query - > results

 

I am noticing that in the BLAST output generated, sequences are missing (~30 from 5000), and if I run parallel and just examine the blocks that are generated, it seems that parallel loses a certain number of records (fasta records) each time it creates a new block. It doesn't seem like the block is breaking at the correct place. Does anyone have any clue as to why this is happening? Any help is appreciated.

Thank you.

blast block parallel gnu parallel • 1.6k views
ADD COMMENTlink modified 4.0 years ago by ole.tange3.3k • written 4.0 years ago by salamayg20
0
gravatar for ole.tange
4.0 years ago by
ole.tange3.3k
Denmark
ole.tange3.3k wrote:

To see if parallel is to blame try using 'cat' instead of blastp:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe cat > results

If you get all the sequences, then GNU Parallel is not losing them.

 

ADD COMMENTlink written 4.0 years ago by ole.tange3.3k

Yes, when I do that it is missing sequences. If I wc the original input and the parallel output, it goes from 10000 to 9588. In the parallel output, the first line starts as:

 

>CTTCACTAGCT

>r9

sequence

 

whereas in the original file, it started as:

>r1

sequence

>r2

sequence

etc.

 

I think there are other instances of missing sequences (where each block is made) but it is hard to find them without going through 10000 lines manually. Do you have any ideas how I could trouble shoot this? Parallel is extremely useful to me but with this little issue I cannot use it.

 

edit: I have found another instance where there is a skip in the read numbers and where the header is altered.

 

>r187.1 |SOURCES={GI=330827700,fw,273802-273903}|ERRORS={27:C,30:C,32:T,62:A,96:A,99:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
CGGGATCGTGCGGGTGGCTCTGTGCATCCTCGTTGGTTTGAGCGGGGGATGAGTCTGCCGTCAGTGCAGTGGGCCAGAGCAACACCCCGCCAAGCAACAAG

>CE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACAGGTCGTGGTTGTGCAGCCCGCCAGCAGCAATTCGAGAGTCATGGGACGCCCCACTATGATGGACGCTCCCACCACGCCAGCATGCAGACCGTGCATCT

>r195.2 |SOURCES={GI=330827700,bw,2584285-2584386}|ERRORS={36:G,40:A,42:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACCCGCAGCCGACCAAATTGCAGCTGCTGGGGAACGGTGCAGATGGCTACCGGTTGGGTTGATGCTTGCGGCTGCTGGCGAAACTGCACCCGAAGTCGGGC

 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by salamayg20

If that is true, you have found a bug. Can you make an example available for download? Quoting it here is unfortunately not enough as \n may be quoted wrongly.

 

ADD REPLYlink written 4.0 years ago by ole.tange3.3k

Sure. I hope this is an appropriate download: http://ge.tt/6d3f9g62/v/0?c?c

I also tried parallel and piping to cat on a different fasta file with more simple headers (to see if there was an issue in the header of the original file) but it would still do the same thing.

 

Also, the exact command I use is:

cat test.fna | parallel -k --gnu --block 100k --recstart '>' --pipe cat > results

 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by salamayg20

It gives exactly the same on 3 of my systems:

$ md5sum test.fna results 
cc27ec20250c65fdbcc0e23fa132eb83  test.fna
cc27ec20250c65fdbcc0e23fa132eb83  results

So what is hitting you is something on your local system. This changes the bug from simple fix to harder debugging, and that should not be done on Biostars.org. Post to bug-parallel@gnu.org and follow "REPORTING BUGS" in 'man parallel'.

ADD REPLYlink written 4.0 years ago by ole.tange3.3k

Okay thank you very much.

ADD REPLYlink written 4.0 years ago by salamayg20

Did you ever find a resolution to this issue? I have also experienced the same issue GNU parallel 20160422

ADD REPLYlink written 2.6 years ago by danielfortin860
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 921 users visited in the last hour